AGENTS.md Template for AI Evaluation Pipeline Design

Overview

This AGENTS.md template provides a complete operating manual for an AI evaluation pipeline, supporting both single-agent and multi-agent orchestration. It defines roles, responsibilities, and governance for AI coding agents in evaluation tasks. This AGENTS.md template is designed to guide teams through designing, validating, and auditing evaluation workflows, with explicit handoffs and memory rules.

When to Use This AGENTS.md Template

When designing an evaluation workflow for AI models and agents
When coordinating multi-agent orchestration with planning, execution, and review
When enforcing tool governance, memory, and source-of-truth across runs
When establishing escalation and human-in-the-loop review for critical evaluations

Copyable AGENTS.md Template

Paste this block into AGENTS.md to initialize the project operating context for an AI evaluation pipeline.

# AGENTS.md

Project Role: AI Evaluation Pipeline Lead
Agent roster and responsibilities:
  - PlannerAgent: designs evaluation plan, experiments, and timelines
  - RunnerAgent: executes evaluation tasks, runs experiments, and collects logs
  - EvaluatorAgent: computes metrics, compares models, and flags anomalies
  - ReviewerAgent: reviews results, approves conclusions, and documents decisions
  - ResearchAgent: sources datasets, benchmarks, and literature for context
  - DomainSpecialistAgent: applies domain constraints and safety guardrails

Supervisor or orchestrator behavior:
  - OrchestratorAgent coordinates task queues, maintains run_id, and enforces deadlines
  - It issues assignments, tracks progress, and surfaces blockers to human review

Handoff rules between agents:
  - Planner -> Runner: handoff task list and run_id, with expected outputs
  - Runner -> Evaluator: handoff raw results and metrics
  - Evaluator -> Reviewer: handoff summarized findings and confidence
  - Researcher -> DomainSpecialist: handoff domain constraints and data quality notes

Context, memory, and source-of-truth rules:
  - All decisions and outputs are stored in the central evaluation store with run_id
  - References to data sources, benchmarks, and models are tracked in the evaluation manifest
  - Source-of-truth: evaluation_results table with stable identifiers

Tool access and permission rules:
  - Tools: evaluation harness, data lake, experiment tracker, secrets manager
  - Access: read/write to run artifacts, metrics, logs; secrets are accessed via secured vault
  - Do not expose keys in code or logs

Architecture rules:
  - Microservice-like components communicate via well-defined interfaces
  - No direct cross-project file system access unless explicitly authorized
  - All tools must emit structured, schema-validated outputs

File structure rules:
  - Place all agent scripts under agents/ with clear naming
  - Keep configuration under configs/ and docs under docs/

Data, API, or integration rules:
  - Data sources must be versioned; API endpoints must be stable and documented
  - All external calls must be logged and retriable

Validation rules:
  - Validate data schema, run_id uniqueness, and metric integrity
  - Ensure results meet minimum confidence thresholds before promotion

Security rules:
  - Secrets must be stored in a vault; no plaintext in code
  - Access controls enforced per role; audit trails enabled

Testing rules:
  - Unit tests for each agent function; integration tests for end-to-end pipeline
  - Mock external services for offline testing

Deployment rules:
  - Deploy evaluation components in staged environments
  - Rollback if run results deviate beyond thresholds

Human review and escalation rules:
  - Escalate failures or anomalies to human reviewers with context
  - Provide a summarized, interpretable justification for decisions

Failure handling and rollback rules:
  - Rollback evaluation run artifacts on failure; preserve logs for audit

Things Agents must not do:
  - Do not leak secrets; do not modify production data without approval
  - Do not perform unsupervised production changes
  - Do not drift from the defined evaluation protocol

Recommended Agent Operating Model

The recommended operating model assigns clear roles with decision boundaries and escalation paths to support reliable AI evaluation. Key agents include PlannerAgent, RunnerAgent, EvaluatorAgent, ReviewerAgent, ResearchAgent, and DomainSpecialistAgent, operating under an OrchestratorAgent that coordinates handoffs and ensures governance and auditability. If a decision exceeds a pre-defined risk threshold or memory/traceability constraint, escalate to human review.

Recommended Project Structure

ai-eval-pipeline/
  agents/
    planner/
    runner/
    evaluator/
    reviewer/
    researcher/
    domain-specialist/
  pipelines/
    default/
  configs/
  data/
  docs/
  tests/

Core Operating Principles

Single source of truth per evaluation run identified by run_id
Explicit, documented handoffs with input/output contracts
Role-based access control and auditable action traces
Deterministic, testable agent behavior with clear failure handling
Separation of concerns: planning, execution, evaluation, and review

Agent Handoff and Collaboration Rules

Planner to Runner: provide task list, run_id, configuration, and deadlines
Runner to Evaluator: supply results, metrics, and potential data issues
Evaluator to Reviewer: summarize findings, confidence, and threshold status
Researcher to DomainSpecialist: attach domain constraints and data quality notes
All to Orchestrator: publish run status and any blockers

Tool Governance and Permission Rules

Command execution must be logged with run_id and user identity
File edits require version control and review for production-affecting changes
API calls must be authenticated; secret tokens never logged
Production systems updated only through approved pipelines with rollback
All external service access must be auditable and rate-limited
Handoffs and gate checks require explicit approval when thresholds are breached

Code Construction Rules

Follow the interfaces defined in Planner and Orchestrator contracts
Validate all inputs against the evaluation schema before execution
Use idempotent operations for run_id to avoid duplicate evaluations
Log structured outputs with machine-readable fields
Avoid hard-coding secrets or credentials in code

Security and Production Rules

Secrets stored in secure vaults; never in code or logs
Access controls by role; multi-factor authentication for sensitive actions
All deployment changes pass through staged environments and rollback plans
Regular audits of evaluation artifacts and access logs

Testing Checklist

Unit tests for each agent function
Integration tests for end-to-end evaluation runs
End-to-end tests in a staging environment with synthetic data
Performance checks for metric calculation and data transfer
Security and access control verification

Common Mistakes to Avoid

Skipping explicit handoffs, leading to context drift
leaking secrets or credentials into logs or artifacts
bypassing human review for high-risk evaluations
unbounded memory growth without purging old run artifacts
over-parameterizing prompts or agent behavior, causing nondeterminism

FAQ

What is the purpose of this AGENTS.md template for AI evaluation pipelines?

It provides a complete operating manual for designing, orchestrating, and governing AI evaluation workflows with multiple agents.

How should handoffs be orchestrated between Planner, Runner, Evaluator, and Reviewer?

Handoffs follow a strict sequence: Planner defines tasks, Runner executes, Evaluator analyzes results, and Reviewer approves before reporting.

How is memory maintained and what is the source of truth for evaluation results?

All decisions and outputs are stored in a central evaluation store with a stable run_id acting as the source of truth.

What are the security and permission rules for tools and secrets in this workflow?

Secrets are stored in a vault, access is role-based, and no plaintext secrets are logged or committed to code.

How do you validate and deploy evaluation results safely?

Results are validated against data schemas and thresholds, then deployed to staging with a rollback plan in case of failures.

Target User

Use Cases