AGENTS.md TemplatesAGENTS.md Template

AGENTS.md Template for AI Evaluation Pipeline Design

AGENTS.md template for an AI evaluation pipeline, enabling multi-agent orchestration, handoffs, and governance.

AGENTS.md templateAI evaluationAI evaluation pipelinemulti-agent orchestrationagent handoff rulestool governancehuman reviewevaluation harnessQA for AI modelssecurity rulestestingdeployment

Target User

Developers, AI/ML engineers, engineering leads, product teams

Use Cases

  • Define and govern an AI evaluation workflow with single-agent and multi-agent orchestration
  • Coordinate planner, runner, evaluator, and reviewer agents across evaluation runs
  • Maintain a single source of truth for evaluations and artifacts
  • Enforce tool access, secrets management, and auditability during experiments

Markdown Template

AGENTS.md Template for AI Evaluation Pipeline Design

# AGENTS.md

Project Role: AI Evaluation Pipeline Lead
Agent roster and responsibilities:
  - PlannerAgent: designs evaluation plan, experiments, and timelines
  - RunnerAgent: executes evaluation tasks, runs experiments, and collects logs
  - EvaluatorAgent: computes metrics, compares models, and flags anomalies
  - ReviewerAgent: reviews results, approves conclusions, and documents decisions
  - ResearchAgent: sources datasets, benchmarks, and literature for context
  - DomainSpecialistAgent: applies domain constraints and safety guardrails

Supervisor or orchestrator behavior:
  - OrchestratorAgent coordinates task queues, maintains run_id, and enforces deadlines
  - It issues assignments, tracks progress, and surfaces blockers to human review

Handoff rules between agents:
  - Planner -> Runner: handoff task list and run_id, with expected outputs
  - Runner -> Evaluator: handoff raw results and metrics
  - Evaluator -> Reviewer: handoff summarized findings and confidence
  - Researcher -> DomainSpecialist: handoff domain constraints and data quality notes

Context, memory, and source-of-truth rules:
  - All decisions and outputs are stored in the central evaluation store with run_id
  - References to data sources, benchmarks, and models are tracked in the evaluation manifest
  - Source-of-truth: evaluation_results table with stable identifiers

Tool access and permission rules:
  - Tools: evaluation harness, data lake, experiment tracker, secrets manager
  - Access: read/write to run artifacts, metrics, logs; secrets are accessed via secured vault
  - Do not expose keys in code or logs

Architecture rules:
  - Microservice-like components communicate via well-defined interfaces
  - No direct cross-project file system access unless explicitly authorized
  - All tools must emit structured, schema-validated outputs

File structure rules:
  - Place all agent scripts under agents/ with clear naming
  - Keep configuration under configs/ and docs under docs/

Data, API, or integration rules:
  - Data sources must be versioned; API endpoints must be stable and documented
  - All external calls must be logged and retriable

Validation rules:
  - Validate data schema, run_id uniqueness, and metric integrity
  - Ensure results meet minimum confidence thresholds before promotion

Security rules:
  - Secrets must be stored in a vault; no plaintext in code
  - Access controls enforced per role; audit trails enabled

Testing rules:
  - Unit tests for each agent function; integration tests for end-to-end pipeline
  - Mock external services for offline testing

Deployment rules:
  - Deploy evaluation components in staged environments
  - Rollback if run results deviate beyond thresholds

Human review and escalation rules:
  - Escalate failures or anomalies to human reviewers with context
  - Provide a summarized, interpretable justification for decisions

Failure handling and rollback rules:
  - Rollback evaluation run artifacts on failure; preserve logs for audit

Things Agents must not do:
  - Do not leak secrets; do not modify production data without approval
  - Do not perform unsupervised production changes
  - Do not drift from the defined evaluation protocol

Overview

This AGENTS.md template provides a complete operating manual for an AI evaluation pipeline, supporting both single-agent and multi-agent orchestration. It defines roles, responsibilities, and governance for AI coding agents in evaluation tasks. This AGENTS.md template is designed to guide teams through designing, validating, and auditing evaluation workflows, with explicit handoffs and memory rules.

When to Use This AGENTS.md Template

  • When designing an evaluation workflow for AI models and agents
  • When coordinating multi-agent orchestration with planning, execution, and review
  • When enforcing tool governance, memory, and source-of-truth across runs
  • When establishing escalation and human-in-the-loop review for critical evaluations

Copyable AGENTS.md Template

Paste this block into AGENTS.md to initialize the project operating context for an AI evaluation pipeline.

# AGENTS.md

Project Role: AI Evaluation Pipeline Lead
Agent roster and responsibilities:
  - PlannerAgent: designs evaluation plan, experiments, and timelines
  - RunnerAgent: executes evaluation tasks, runs experiments, and collects logs
  - EvaluatorAgent: computes metrics, compares models, and flags anomalies
  - ReviewerAgent: reviews results, approves conclusions, and documents decisions
  - ResearchAgent: sources datasets, benchmarks, and literature for context
  - DomainSpecialistAgent: applies domain constraints and safety guardrails

Supervisor or orchestrator behavior:
  - OrchestratorAgent coordinates task queues, maintains run_id, and enforces deadlines
  - It issues assignments, tracks progress, and surfaces blockers to human review

Handoff rules between agents:
  - Planner -> Runner: handoff task list and run_id, with expected outputs
  - Runner -> Evaluator: handoff raw results and metrics
  - Evaluator -> Reviewer: handoff summarized findings and confidence
  - Researcher -> DomainSpecialist: handoff domain constraints and data quality notes

Context, memory, and source-of-truth rules:
  - All decisions and outputs are stored in the central evaluation store with run_id
  - References to data sources, benchmarks, and models are tracked in the evaluation manifest
  - Source-of-truth: evaluation_results table with stable identifiers

Tool access and permission rules:
  - Tools: evaluation harness, data lake, experiment tracker, secrets manager
  - Access: read/write to run artifacts, metrics, logs; secrets are accessed via secured vault
  - Do not expose keys in code or logs

Architecture rules:
  - Microservice-like components communicate via well-defined interfaces
  - No direct cross-project file system access unless explicitly authorized
  - All tools must emit structured, schema-validated outputs

File structure rules:
  - Place all agent scripts under agents/ with clear naming
  - Keep configuration under configs/ and docs under docs/

Data, API, or integration rules:
  - Data sources must be versioned; API endpoints must be stable and documented
  - All external calls must be logged and retriable

Validation rules:
  - Validate data schema, run_id uniqueness, and metric integrity
  - Ensure results meet minimum confidence thresholds before promotion

Security rules:
  - Secrets must be stored in a vault; no plaintext in code
  - Access controls enforced per role; audit trails enabled

Testing rules:
  - Unit tests for each agent function; integration tests for end-to-end pipeline
  - Mock external services for offline testing

Deployment rules:
  - Deploy evaluation components in staged environments
  - Rollback if run results deviate beyond thresholds

Human review and escalation rules:
  - Escalate failures or anomalies to human reviewers with context
  - Provide a summarized, interpretable justification for decisions

Failure handling and rollback rules:
  - Rollback evaluation run artifacts on failure; preserve logs for audit

Things Agents must not do:
  - Do not leak secrets; do not modify production data without approval
  - Do not perform unsupervised production changes
  - Do not drift from the defined evaluation protocol

Recommended Agent Operating Model

The recommended operating model assigns clear roles with decision boundaries and escalation paths to support reliable AI evaluation. Key agents include PlannerAgent, RunnerAgent, EvaluatorAgent, ReviewerAgent, ResearchAgent, and DomainSpecialistAgent, operating under an OrchestratorAgent that coordinates handoffs and ensures governance and auditability. If a decision exceeds a pre-defined risk threshold or memory/traceability constraint, escalate to human review.

Recommended Project Structure

ai-eval-pipeline/
  agents/
    planner/
    runner/
    evaluator/
    reviewer/
    researcher/
    domain-specialist/
  pipelines/
    default/
  configs/
  data/
  docs/
  tests/

Core Operating Principles

  • Single source of truth per evaluation run identified by run_id
  • Explicit, documented handoffs with input/output contracts
  • Role-based access control and auditable action traces
  • Deterministic, testable agent behavior with clear failure handling
  • Separation of concerns: planning, execution, evaluation, and review

Agent Handoff and Collaboration Rules

  • Planner to Runner: provide task list, run_id, configuration, and deadlines
  • Runner to Evaluator: supply results, metrics, and potential data issues
  • Evaluator to Reviewer: summarize findings, confidence, and threshold status
  • Researcher to DomainSpecialist: attach domain constraints and data quality notes
  • All to Orchestrator: publish run status and any blockers

Tool Governance and Permission Rules

  • Command execution must be logged with run_id and user identity
  • File edits require version control and review for production-affecting changes
  • API calls must be authenticated; secret tokens never logged
  • Production systems updated only through approved pipelines with rollback
  • All external service access must be auditable and rate-limited
  • Handoffs and gate checks require explicit approval when thresholds are breached

Code Construction Rules

  • Follow the interfaces defined in Planner and Orchestrator contracts
  • Validate all inputs against the evaluation schema before execution
  • Use idempotent operations for run_id to avoid duplicate evaluations
  • Log structured outputs with machine-readable fields
  • Avoid hard-coding secrets or credentials in code

Security and Production Rules

  • Secrets stored in secure vaults; never in code or logs
  • Access controls by role; multi-factor authentication for sensitive actions
  • All deployment changes pass through staged environments and rollback plans
  • Regular audits of evaluation artifacts and access logs

Testing Checklist

  • Unit tests for each agent function
  • Integration tests for end-to-end evaluation runs
  • End-to-end tests in a staging environment with synthetic data
  • Performance checks for metric calculation and data transfer
  • Security and access control verification

Common Mistakes to Avoid

  • Skipping explicit handoffs, leading to context drift
  • leaking secrets or credentials into logs or artifacts
  • bypassing human review for high-risk evaluations
  • unbounded memory growth without purging old run artifacts
  • over-parameterizing prompts or agent behavior, causing nondeterminism

Related implementation resources: AI Use Case for Sales Pipeline Reviews and Deal Risk Scoring and AI Use Case for Corporate Event Managers Using Slack To Orchestrate Day-Of Venue Tasks Across Multi-Department Teams.

FAQ

What is the purpose of this AGENTS.md template for AI evaluation pipelines?

It provides a complete operating manual for designing, orchestrating, and governing AI evaluation workflows with multiple agents.

How should handoffs be orchestrated between Planner, Runner, Evaluator, and Reviewer?

Handoffs follow a strict sequence: Planner defines tasks, Runner executes, Evaluator analyzes results, and Reviewer approves before reporting.

How is memory maintained and what is the source of truth for evaluation results?

All decisions and outputs are stored in a central evaluation store with a stable run_id acting as the source of truth.

What are the security and permission rules for tools and secrets in this workflow?

Secrets are stored in a vault, access is role-based, and no plaintext secrets are logged or committed to code.

How do you validate and deploy evaluation results safely?

Results are validated against data schemas and thresholds, then deployed to staging with a rollback plan in case of failures.