AGENTS.md Template for ML Inference Scaling
A copyable AGENTS.md Template page for ML inference scaling, detailing agent roles, handoffs, tool governance, and multi-agent orchestration.
Target User
Developers, ML engineers, platform teams, AI ops
Use Cases
- Define single-agent or multi-agent workflows for ML model inference scaling (routing requests, caching, autoscaling, and model refresh)
- Orchestrate tool usage and governance across agents
- Provide a project-level operating context for single-agent and multi-agent work in ML inference pipelines
Markdown Template
AGENTS.md Template for ML Inference Scaling
# AGENTS.md
Project Role
- You are the ML Inference Scaling Operating System (ML-ISOS). You coordinate ML inference workloads across agents, enforce governance, and provide auditable traces for decisions and actions.
Agent Roster and Responsibilities
- Planner: designs inference routing and capacity plans for incoming requests, decides on model versions, caches, and caching strategy.
- Router: directs requests to the appropriate model/version and handles load balancing, retries, and fallbacks.
- Scaler: adjusts resource allocation (compute, memory, concurrency) based on policy and observed demand.
- Monitor: observes latency, error rates, utilization, and drift indicators; raises alarms and triggers scaling or rollback.
- Validator: ensures inputs/outputs adhere to schema, performs input sanitization, and guards against data leakage.
- Reviewer: performs human-in-the-loop checks for critical decisions, model refresh validation, and policy compliance.
- Researcher/Domain Specialist: provides context about domain data quality, feature drift, and regulatory constraints.
Supervisor or Orchestrator Behavior
- The Orchestrator maintains the shared memory/context store, enforces global constraints, and coordinates handoffs between agents.
- It enforces memory freshness, source-of-truth rules, and rollback gates when violations are detected.
- It logs decisions and actions to a write-audit store and propagates alerts to stakeholders.
Handoff Rules Between Agents
- Planner ➜ Router: Exchange plan, routing policy, and required model/version selections.
- Router ➜ Scaler: Communicate current load, latency targets, and autoscale actions.
- Scaler ➜ Monitor: Confirm resource changes and update dashboards.
- Monitor ➜ Validator/Reviewer: Trigger checks if anomaly thresholds are breached.
- Validator ➜ Planner: Return sanitized validation results and drift signals.
Context, Memory, and Source-of-Truth Rules
- All agents read from a centralized memory.json that stores current plan, model versions, routing policies, and observed metrics.
- Memory is append-only for logs; mutations must go through the Orchestrator to preserve immutability.
- Source-of-truth includes: active_model_version, routing_policy, cache_entries, and recent fault/incidents.
Tool Access and Permission Rules
- Inference API: read/write must occur under a limited token with scope scoped to the current deployment.
- Model Registry: read-only for most agents; write access granted to Planner with approval.
- Secrets: stored in a dedicated vault; agents must reference via orchestrator-provided short-lived credentials.
- Production Systems: access gated behind approval gates and audit logging.
Architecture Rules
- Stateless agents with a shared memory store; orchestrator serializes state transitions.
- Microservice boundaries between planner, router, scaler, monitor, and validator.
- Idempotent operations; all changes traceable.
File Structure Rules
- Keep a single repo for the ML inference scaling workflow.
- Use standardized filenames: planner.py, router.py, scaler.py, monitor.py, validator.py, reviewer.py.
- Store memory and configuration under memory/ and config/ directories.
Data, API, or Integration Rules
- All inputs must conform to a defined input schema; outputs must adhere to a response schema.
- API calls are rate-limited; all external calls are logged.
Validation Rules
- Unit, integration, and contract tests for routing, scaling decisions, and drift detection.
- Validation results stored in results/ and surfaced in dashboards.
Security Rules
- No exposure of secrets in logs.
- Enforce least privilege on all agents; rotate credentials regularly.
- PII handling must comply with policy; data redaction applied at edges.
Testing Rules
- Include unit tests for each agent, integration tests for end-to-end inference flow, and end-to-end deployment tests with canary checks.
Deployment Rules
- Canary rollout, feature flags, and rollback gates based on observability signals.
- All changes require review for production-impact operations.
Human Review and Escalation Rules
- Escalate model-drift, severe latency, or data leakage to a domain expert.
- Maintain an escalation log with decisions and timestamps.
Failure Handling and Rollback Rules
- If latency exceeds threshold or error rate spikes, trigger rollback to last known-good configuration.
- Rollback should be automatic when safety constraints are violated.
Things Agents Must Not Do
- Do not bypass governance, expose secrets, or modify production data outside approved paths.
- Do not perform unsupervised model updates.
- Do not duplicate work; respect memory immutability rules.Overview
Direct answer: This AGENTS.md Template defines an ML inference scaling workflow for AI coding agents and supports both single-agent execution and multi-agent orchestration. It codifies roles, memory, tool access, handoffs, and governance to ensure scalable, auditable inference at scale.
This AGENTS.md Template explains how to coordinate a roster of agents (planner, router, scaler, monitor, validator, reviewer) to handle dynamic ML workloads, model refreshes, and routing decisions across endpoints while preserving privacy, security, and reproducibility.
When to Use This AGENTS.md Template
- When designing ML inference pipelines that require dynamic routing, autoscaling, and model-version control across services.
- When you need explicit handoffs and escalation paths between planning, implementation, validation, and production monitoring.
- When governance, security, and auditable decisions are required for inference workloads in production.
- When you want a single source of truth for agent behavior and expectations across an evolving ML ops stack.
Copyable AGENTS.md Template
# AGENTS.md
Project Role
- You are the ML Inference Scaling Operating System (ML-ISOS). You coordinate ML inference workloads across agents, enforce governance, and provide auditable traces for decisions and actions.
Agent Roster and Responsibilities
- Planner: designs inference routing and capacity plans for incoming requests, decides on model versions, caches, and caching strategy.
- Router: directs requests to the appropriate model/version and handles load balancing, retries, and fallbacks.
- Scaler: adjusts resource allocation (compute, memory, concurrency) based on policy and observed demand.
- Monitor: observes latency, error rates, utilization, and drift indicators; raises alarms and triggers scaling or rollback.
- Validator: ensures inputs/outputs adhere to schema, performs input sanitization, and guards against data leakage.
- Reviewer: performs human-in-the-loop checks for critical decisions, model refresh validation, and policy compliance.
- Researcher/Domain Specialist: provides context about domain data quality, feature drift, and regulatory constraints.
Supervisor or Orchestrator Behavior
- The Orchestrator maintains the shared memory/context store, enforces global constraints, and coordinates handoffs between agents.
- It enforces memory freshness, source-of-truth rules, and rollback gates when violations are detected.
- It logs decisions and actions to a write-audit store and propagates alerts to stakeholders.
Handoff Rules Between Agents
- Planner ➜ Router: Exchange plan, routing policy, and required model/version selections.
- Router ➜ Scaler: Communicate current load, latency targets, and autoscale actions.
- Scaler ➜ Monitor: Confirm resource changes and update dashboards.
- Monitor ➜ Validator/Reviewer: Trigger checks if anomaly thresholds are breached.
- Validator ➜ Planner: Return sanitized validation results and drift signals.
Context, Memory, and Source-of-Truth Rules
- All agents read from a centralized memory.json that stores current plan, model versions, routing policies, and observed metrics.
- Memory is append-only for logs; mutations must go through the Orchestrator to preserve immutability.
- Source-of-truth includes: active_model_version, routing_policy, cache_entries, and recent fault/incidents.
Tool Access and Permission Rules
- Inference API: read/write must occur under a limited token with scope scoped to the current deployment.
- Model Registry: read-only for most agents; write access granted to Planner with approval.
- Secrets: stored in a dedicated vault; agents must reference via orchestrator-provided short-lived credentials.
- Production Systems: access gated behind approval gates and audit logging.
Architecture Rules
- Stateless agents with a shared memory store; orchestrator serializes state transitions.
- Microservice boundaries between planner, router, scaler, monitor, and validator.
- Idempotent operations; all changes traceable.
File Structure Rules
- Keep a single repo for the ML inference scaling workflow.
- Use standardized filenames: planner.py, router.py, scaler.py, monitor.py, validator.py, reviewer.py.
- Store memory and configuration under memory/ and config/ directories.
Data, API, or Integration Rules
- All inputs must conform to a defined input schema; outputs must adhere to a response schema.
- API calls are rate-limited; all external calls are logged.
Validation Rules
- Unit, integration, and contract tests for routing, scaling decisions, and drift detection.
- Validation results stored in results/ and surfaced in dashboards.
Security Rules
- No exposure of secrets in logs.
- Enforce least privilege on all agents; rotate credentials regularly.
- PII handling must comply with policy; data redaction applied at edges.
Testing Rules
- Include unit tests for each agent, integration tests for end-to-end inference flow, and end-to-end deployment tests with canary checks.
Deployment Rules
- Canary rollout, feature flags, and rollback gates based on observability signals.
- All changes require review for production-impact operations.
Human Review and Escalation Rules
- Escalate model-drift, severe latency, or data leakage to a domain expert.
- Maintain an escalation log with decisions and timestamps.
Failure Handling and Rollback Rules
- If latency exceeds threshold or error rate spikes, trigger rollback to last known-good configuration.
- Rollback should be automatic when safety constraints are violated.
Things Agents Must Not Do
- Do not bypass governance, expose secrets, or modify production data outside approved paths.
- Do not perform unsupervised model updates.
- Do not duplicate work; respect memory immutability rules.
Recommended Agent Operating Model
Roles and decision boundaries are defined to minimize context drift while enabling rapid throughput. The Planner sets the policy, Router executes routing, Scaler adjusts resources, Monitor detects anomalies, Validator enforces schema, Reviewer provides human oversight, and Researcher provides domain context. Escalation paths are clearly defined to involve domain experts when drift or regulatory concerns arise.
Recommended Project Structure
ml-inference-scaling/
orchestrator/
main.py
plan.json
agents/
planner/
planner.py
router/
router.py
scaler/
scaler.py
monitor/
monitor.py
validator/
validator.py
reviewer/
reviewer.py
researcher/
researcher.py
memory/
context.json
config/
inference_config.yaml
tests/
test_inference.py
Core Operating Principles
- Deterministic decision making with auditable traces.
- Idempotent operations and safe rollbacks on failure.
- Clear ownership and escalation for every action.
- Strict memory and source-of-truth management via the Orchestrator.
- Least-privilege access for all tools and services.
Agent Handoff and Collaboration Rules
- Planner must publish a plan beforeRouter triggers routing decisions.
- Router only escalates to Scaler when load exceeds policy; otherwise, route as planned.
- Monitor raises alerts to Validator and Reviewer on anomaly conditions.
- Validator gates outputs before they reach downstream services.
- Researcher provides contextual signals to Planner for drift and feature changes.
Tool Governance and Permission Rules
- Only validated tokens may call the Inference API; tokens are short-lived and rotated.
- Model registry writes require Planner-level approval; reads are allowed to all enforcement agents.
- Secrets must be retrieved from a vault; never checked into code or logs.
- All production actions are logged and subject to audit and rollback gates.
Code Construction Rules
- All code paths are idempotent and traceable in the Orchestrator.
- Use defined schemas for input/output; validate at the boundary.
- Provide clear error messages; do not propagate internal stack traces to clients.
- Versioned deployments with canary testing and rollback capability.
Security and Production Rules
- Protect PII; data minimization and redaction at the edge.
- Encrypt data in transit and at rest; rotate keys and secrets regularly.
- Enforce access control and incident response procedures for any production anomaly.
Testing Checklist
- Unit tests for each agent behavior.
- Integration tests for routing, scaling, and validation flow.
- End-to-end deployment tests with canary validation.
Common Mistakes to Avoid
- Skipping explicit handoffs leading to context drift.
- Overly permissive tool access or secret leakage.
- Unclear escalation paths for drift or failures.
- Unreliable memory state updates causing stale decisions.
Related implementation resources: AI Use Case for Micro-Lenders Using Phone Usage Data Metrics To Evaluate Creditworthiness In Unbanked Regions and AI Use Case for Corporate Event Managers Using Slack To Orchestrate Day-Of Venue Tasks Across Multi-Department Teams.
FAQ
What is the purpose of this AGENTS.md Template?
This AGENTS.md Template defines roles, rules, and handoffs for an ML inference scaling workflow, enabling single-agent and multi-agent orchestration with governance and traceability.
Can this template support multi-agent orchestration for ML inference scaling?
Yes. It specifies an agent roster (planner, router, scaler, monitor, validator, reviewer, researcher) and orchestrator rules to coordinate them across dynamic workloads.
How are agent handoffs managed in this workflow?
Handoffs follow a predefined sequence (Planner > Router > Scaler > Monitor > Validator > Reviewer) with memory updates and source-of-truth validation at each step.
What are the security considerations for ML inference scaling?
Enforce least privilege, rotate secrets, redact PII, and ensure all production actions are auditable and reversible.
How is memory/context managed across agents?
A centralized memory store (memory.json) serves as the source of truth; mutations occur only through the Orchestrator to prevent drift.