CLAUDE.md Templatestemplate

CLAUDE.md Template for Direct OpenAI API Integration

A production-ready CLAUDE.md template for direct OpenAI API integrations, focusing on strict structured outputs using Pydantic, native async streaming, precise token handling, and robust network retry resiliency.

CLAUDE.mdOpenAI APIStructured OutputsPython SDKAsyncIOToken ManagementPrompt EngineeringAI Coding Assistant

Target User

Backend developers, AI engineers, systems architects, and SaaS builders who prefer lightweight, highly optimized, direct OpenAI SDK implementations over heavy orchestrators

Use Cases

  • Building zero-overhead asynchronous direct OpenAI API wrappers
  • Enforcing strict Pydantic parsing with native OpenAI Structured Outputs
  • Implementing low-latency chunk-by-chunk async streaming loops
  • Configuring resilient exponential backoff and rate-limit retry logic
  • Optimizing raw context lengths and system/user prompt definitions

Markdown Template

CLAUDE.md Template for Direct OpenAI API Integration

# CLAUDE.md: Production OpenAI SDK Integration Guide

You are operating as a Senior AI Infrastructure Engineer specialized in low-latency direct LLM integrations, precise token utilization, and high-throughput asynchronous execution layouts.

Your primary objective is to build clean, framework-free, resilient wrappers using the native OpenAI Python SDK.

## Core Integration Principles

- **Native Async Clients**: Always utilize the asynchronous SDK engine (`AsyncOpenAI`) for backend server loops. Never instantiate blocking synchronous clients within async workflows.
- **Guaranteed Structured Outputs**: Maximize data reliability by enforcing strict response formats. Use `response_format=PydanticModel` inside `beta.chat.completions.parse()` for absolute schema validation.
- **Granular Token & Context Control**: Always configure model parameters explicitly (`max_completion_tokens`, `temperature`, `seed`) to ensure deterministic boundaries.
- **Resilient Network Lifecycles**: Encapsulate completion tasks within explicit retry logic using exponential backoff to handle rate limits (HTTP 429) and transient gateway issues smoothly.

## Code Construction Rules

### 1. Client Lifecycle & Singleton Pattern
- Initialize a single, centralized `AsyncOpenAI` instance as a application lifecycle dependency or global singleton. Never recreate client contexts dynamically per request.
- Load sensitive credentials via secure environment configurations using strong settings parsing libraries. Never hardcode API keys or project identifiers.

### 2. Payload Structure & Prompt Engineering
- Maintain a complete separation between application code and prompts. Isolate prompt definitions into dedicated templates or static structured repositories.
- When defining tools for function calling, wrap execution boundaries tightly with strict Pydantic argument parameters, matching specifications exactly with the model schema array.

### 3. Async Streaming & Event Handling
- Implement low-latency streaming utilizing `await client.chat.completions.create(..., stream=True)` blocks.
- Parse streaming iterations cleanly chunk-by-chunk, ensuring delta message content is verified before appending to transmission arrays or server-sent event (SSE) buffers.

### 4. Error Mapping & Defensiveness
- Isolate OpenAI-specific errors with high precision. Write distinct exception catches for `openai.RateLimitError`, `openai.APIConnectionError`, and `openai.BadRequestError`.
- Always capture token usage details directly from response metadata headers (`usage` block) to ensure accurate logging for telemetry and cost tracking layers.

## Formatting & Diagnostics
- Write unit tests using specialized mock client fixtures (`unittest.mock` or `pytest-mock`) to simulate precise completion shapes and failure codes without using live API tokens.
- Maintain absolute traceability by passing structured metadata strings (e.g., system user contexts or trace indices) within custom payload headers where applicable.

What is this CLAUDE.md template for?

This CLAUDE.md template instructs your AI coding assistant to design direct, native OpenAI API integrations without the code bloat of heavy orchestration frameworks. While wrapper libraries have their place, direct SDK implementations provide the lowest latency, full feature surface coverage, and predictable memory overhead for production applications.

This configuration establishes explicit guardrails for asynchronous execution, strict schema enforcement via native OpenAI Structured Outputs, streaming chunk processing, token math optimization, and structural retry mechanics for absolute resilience.

When to use this template

Use this template when implementing high-volume microservices, low-latency streaming endpoints, complex JSON extraction jobs, direct function/tool execution loops, or cost-sensitive data classification tasks where you need granular control over the raw payload mechanics.

Recommended project structure

project-root/
  app/
    services/
      openai_client.py
      prompt_manager.py
    schemas/
      completion_shapes.py
      tool_specs.py
    core/
      config.py
      exceptions.py
    main.py
  tests/
  .env.example
  CLAUDE.md
  requirements.txt

CLAUDE.md Template

# CLAUDE.md: Production OpenAI SDK Integration Guide

You are operating as a Senior AI Infrastructure Engineer specialized in low-latency direct LLM integrations, precise token utilization, and high-throughput asynchronous execution layouts.

Your primary objective is to build clean, framework-free, resilient wrappers using the native OpenAI Python SDK.

## Core Integration Principles

- **Native Async Clients**: Always utilize the asynchronous SDK engine (`AsyncOpenAI`) for backend server loops. Never instantiate blocking synchronous clients within async workflows.
- **Guaranteed Structured Outputs**: Maximize data reliability by enforcing strict response formats. Use `response_format=PydanticModel` inside `beta.chat.completions.parse()` for absolute schema validation.
- **Granular Token & Context Control**: Always configure model parameters explicitly (`max_completion_tokens`, `temperature`, `seed`) to ensure deterministic boundaries.
- **Resilient Network Lifecycles**: Encapsulate completion tasks within explicit retry logic using exponential backoff to handle rate limits (HTTP 429) and transient gateway issues smoothly.

## Code Construction Rules

### 1. Client Lifecycle & Singleton Pattern
- Initialize a single, centralized `AsyncOpenAI` instance as a application lifecycle dependency or global singleton. Never recreate client contexts dynamically per request.
- Load sensitive credentials via secure environment configurations using strong settings parsing libraries. Never hardcode API keys or project identifiers.

### 2. Payload Structure & Prompt Engineering
- Maintain a complete separation between application code and prompts. Isolate prompt definitions into dedicated templates or static structured repositories.
- When defining tools for function calling, wrap execution boundaries tightly with strict Pydantic argument parameters, matching specifications exactly with the model schema array.

### 3. Async Streaming & Event Handling
- Implement low-latency streaming utilizing `await client.chat.completions.create(..., stream=True)` blocks.
- Parse streaming iterations cleanly chunk-by-chunk, ensuring delta message content is verified before appending to transmission arrays or server-sent event (SSE) buffers.

### 4. Error Mapping & Defensiveness
- Isolate OpenAI-specific errors with high precision. Write distinct exception catches for `openai.RateLimitError`, `openai.APIConnectionError`, and `openai.BadRequestError`.
- Always capture token usage details directly from response metadata headers (`usage` block) to ensure accurate logging for telemetry and cost tracking layers.

## Formatting & Diagnostics
- Write unit tests using specialized mock client fixtures (`unittest.mock` or `pytest-mock`) to simulate precise completion shapes and failure codes without using live API tokens.
- Maintain absolute traceability by passing structured metadata strings (e.g., system user contexts or trace indices) within custom payload headers where applicable.

Why this template matters

Direct SDK integrations require careful programming discipline. AI models left unchecked often mix up sync and async client initializations, generate incorrect payload keys, or fail to handle network timeouts. More importantly, they frequently write loose JSON prompts instead of using OpenAI's guaranteed parse() methods, which leads to intermittent schema failures in production.

This blueprint completely prevents these common bugs, enforcing async client lifecycles, bulletproof Zod/Pydantic validation layers, and efficient stream parsing configurations automatically.

Recommended additions

  • Add targeted specifications for tracking exact input and output token consumption using tickers like tiktoken.
  • Incorporate pre-built blueprints for executing parallel asynchronous batch completions using asyncio tasks.
  • Define a clear structure for injecting custom caching wrappers (e.g., Redis semantic caching) ahead of direct API requests.
  • Include explicit guidelines for parsing multi-modal payloads (e.g., passing image data chunks securely).

FAQ

Why does this template favor direct SDK usage over LangChain or LlamaIndex?

Direct usage eliminates unnecessary abstraction layers, provides lower latency overhead, allows immediate access to new OpenAI features, and gives you absolute control over execution contexts and prompt payloads.

How does the template handle schema validation?

It explicitly mandates the use of OpenAI's native Structured Outputs (`chat.completions.parse()`), ensuring that the returned model responses strictly match your typed Pydantic models with 100% fidelity.

Does it include automatic error recovery?

Yes. The rules require wrapping direct calls in explicit exception structures with exponential backoff logic, safely handling rate boundaries and network interruptions without crashing your backend application.

Can this handle streaming and function calling concurrently?

Absolutely. The architecture rules define separate guidelines for setting up async streaming streams and typed tool calling parameters, allowing you to combine them safely for low-latency interactive workflows.

About the author

Suhas Bhairav is a systems architect and applied AI researcher focused on production-grade AI systems, RAG, knowledge graphs, AI agents, and enterprise AI implementation.