AGENTS.md Template for Production Debugging Agents
AGENTS.md Template for Production Debugging Agents provides a copyable operating manual for single-agent and multi-agent production debugging workflows, with tool governance and escalation rules.
Target User
Developers, SREs, and engineering leaders building AI coding agents for production debugging.
Use Cases
- Incident investigation and root cause analysis
- Live debugging in production
- Multi-agent orchestration for debugging workflows
- Tool governance enforcement
- Security-conscious debug sessions
Markdown Template
AGENTS.md Template for Production Debugging Agents
# AGENTS.md
Project Role: Production Debugging Agent Suite Lead
Agent roster and responsibilities:
- PlannerAgent: Responsible for devising the debugging plan and tasking other agents
- TelemetryAgent: Collects relevant logs, traces, metrics, and events
- ReproAgent: Attempts to reproduce the incident in a safe sandbox
- ScriptRunnerAgent: Executes debugging scripts and queries against observability layers
- ValidationAgent: Validates outcomes against acceptance criteria
- DomainExpertAgent: Provides expert guidance for service-specific diagnostics
- SecurityAuditorAgent: Ensures secrets are not exposed and compliance is maintained
- HumanReviewAgent: Escalates to on-call engineers when needed
Supervisor or orchestrator behavior:
The DebugOps Orchestrator coordinates planning, handoffs, and enforcement of tool governance. It stores context to memory, enforces access controls, and logs actions for audit.
Handoff rules between agents:
- Planner -> TelemetryAgent and ReproAgent according to plan
- TelemetryAgent + ReproAgent -> ScriptRunnerAgent to run tests and gather evidence
- ScriptRunnerAgent -> ValidationAgent to verify results
- ValidationAgent -> HumanReviewAgent if confidence is low or risk is high
Context, memory, and source-of-truth rules:
- IncidentId, affected service, environment, window of interest, and runbook are stored in a central incident repository
- All outputs must be appended to a single source of truth record; memory is ephemeral per incident until closed
- Source of truth is the incident management system and the shared log store
Tool access and permission rules:
- Agents may read logs, traces, and metrics; may execute non-destructive read actions
- ScriptRunnerAgent may execute scripts in a sandbox; no direct production changes without approval
- Secrets must be retrieved from a secret manager; never written to disk unencrypted
Architecture rules:
- All agents run in a sandboxed environment; orchestration via the DebugOps Orchestrator; no direct prod access
- All actions are idempotent; retries are safe
File structure rules:
- Use a per-incident workspace with runbooks and evidence stored under incident-/
Data, API, or integration rules when relevant:
- Data collected must be redacted for PII; API calls must be rate-limited and logged
Validation rules:
- Each step must have objective acceptance criteria; if not met, trigger a replan or escalation
Security rules:
- Secrets management; audit logging; access controls; no PII exposure
Testing rules:
- Include unit tests for agents and integration tests with simulated incidents
Deployment rules:
- Debug sessions are ephemeral; changes must be validated and not deployed to production code without approvals
Human review and escalation rules:
- If the agent is uncertain or risk is high, escalate to HumanReviewAgent and on-call engineer
Failure handling and rollback rules:
- If an action fails, roll back to the last known good state and re-run from the plan boundary
Things Agents must not do:
- Do not patch production code; do not exfiltrate data; do not bypass approvals; do not modify production state without consentOverview
Direct answer: This AGENTS.md Template defines the single-agent and multi-agent production debugging workflow and provides a structured operating manual that supports both isolated agent runs and orchestrated collaborations.
It governs incident investigation, root-cause analysis, live debugging in production, tool governance, memory rules, and human review, enabling clear handoffs and auditable outputs across the debugging lifecycle.
When to Use This AGENTS.md Template
- When you are building AI coding agents to diagnose and fix production incidents in a controlled, auditable way.
- When you need repeatable orchestration patterns for multi-agent debugging across services, hosts, and environments.
- When safety, security, and governance must be baked into incident response and rollbacks.
- When you require explicit handoffs, memory context, and a central source of truth to avoid context drift.
Copyable AGENTS.md Template
# AGENTS.md
Project Role: Production Debugging Agent Suite Lead
Agent roster and responsibilities:
- PlannerAgent: Responsible for devising the debugging plan and tasking other agents
- TelemetryAgent: Collects relevant logs, traces, metrics, and events
- ReproAgent: Attempts to reproduce the incident in a safe sandbox
- ScriptRunnerAgent: Executes debugging scripts and queries against observability layers
- ValidationAgent: Validates outcomes against acceptance criteria
- DomainExpertAgent: Provides expert guidance for service-specific diagnostics
- SecurityAuditorAgent: Ensures secrets are not exposed and compliance is maintained
- HumanReviewAgent: Escalates to on-call engineers when needed
Supervisor or orchestrator behavior:
The DebugOps Orchestrator coordinates planning, handoffs, and enforcement of tool governance. It stores context to memory, enforces access controls, and logs actions for audit.
Handoff rules between agents:
- Planner -> TelemetryAgent and ReproAgent according to plan
- TelemetryAgent + ReproAgent -> ScriptRunnerAgent to run tests and gather evidence
- ScriptRunnerAgent -> ValidationAgent to verify results
- ValidationAgent -> HumanReviewAgent if confidence is low or risk is high
Context, memory, and source-of-truth rules:
- IncidentId, affected service, environment, window of interest, and runbook are stored in a central incident repository
- All outputs must be appended to a single source of truth record; memory is ephemeral per incident until closed
- Source of truth is the incident management system and the shared log store
Tool access and permission rules:
- Agents may read logs, traces, and metrics; may execute non-destructive read actions
- ScriptRunnerAgent may execute scripts in a sandbox; no direct production changes without approval
- Secrets must be retrieved from a secret manager; never written to disk unencrypted
Architecture rules:
- All agents run in a sandboxed environment; orchestration via the DebugOps Orchestrator; no direct prod access
- All actions are idempotent; retries are safe
File structure rules:
- Use a per-incident workspace with runbooks and evidence stored under incident-/
Data, API, or integration rules when relevant:
- Data collected must be redacted for PII; API calls must be rate-limited and logged
Validation rules:
- Each step must have objective acceptance criteria; if not met, trigger a replan or escalation
Security rules:
- Secrets management; audit logging; access controls; no PII exposure
Testing rules:
- Include unit tests for agents and integration tests with simulated incidents
Deployment rules:
- Debug sessions are ephemeral; changes must be validated and not deployed to production code without approvals
Human review and escalation rules:
- If the agent is uncertain or risk is high, escalate to HumanReviewAgent and on-call engineer
Failure handling and rollback rules:
- If an action fails, roll back to the last known good state and re-run from the plan boundary
Things Agents must not do:
- Do not patch production code; do not exfiltrate data; do not bypass approvals; do not modify production state without consent
Recommended Agent Operating Model
The model defines roles, responsibilities, decision boundaries, and escalation paths for the production debugging agents. The planner proposes plans; the implementer executes; the reviewer validates; the tester confirms; and humans intervene when risk is high.
Recommended Project Structure
Workflow-specific directory tree:
production-debugging-agents/
├─ orchestrator/
│ └─ planner.py
├─ agents/
│ ├─ planner/
│ │ └─ plan.py
│ │ └─ interface.py
│ ├─ telemetry-collector/
│ │ └─ collector.py
│ ├─ repro-runner/
│ │ └─ reproduce.py
│ ├─ script-executor/
│ │ └─ executor.py
│ ├─ validator/
│ │ └─ validator.py
│ ├─ domain-expert/
│ │ └─ expert.py
│ └─ security-auditor/
│ └─ audit.py
├─ data/
│ └─ incidents/
├─ runbooks/
├─ tests/
│ └─ integration/
└─ docs/
Core Operating Principles
- Single source of truth per incident and auditable outputs
- Idempotent actions and deterministic planning
- Clear ownership and explicit handoffs
- Least privilege and sandboxed execution
- Time-boxed debugging sessions with defined escalation
- Traceability and versioning of outputs
Agent Handoff and Collaboration Rules
- Planner to TelemetryAgent: share plan, required data sources, success criteria
- TelemetryAgent to ReproAgent: provide logs, traces, and reproduction steps
- ReproAgent to ScriptRunnerAgent: provide reproduction evidence and test inputs
- ScriptRunnerAgent to ValidationAgent: deliver outputs and acceptance criteria
- ValidationAgent to HumanReviewAgent: escalate when confidence is low or risk is high
- DomainExpertAgent: intervenes on service-specific gaps; maintains knowledge base
Tool Governance and Permission Rules
- All tool calls pass through the orchestrator
- Scripts run in sandbox; no direct edits to production config without approval
- Secrets retrieved from vaults, never written to disk
- Auto-approval gates require human sign-off for production-affecting actions
Code Construction Rules
- Write deterministic, well-typed plans; avoid side effects in planning
- Make all steps idempotent and retry-safe
- Log all actions with incident IDs and agent names
Security and Production Rules
- All agents operate in sandboxed environments
- Secrets managed by vault; access must be audited
- Production changes require explicit approval and rollback strategy
Testing Checklist
- Unit tests for each agent
- Integration tests with simulated incidents
- End-to-end tests of the orchestration pipeline
- Security and access control tests
Common Mistakes to Avoid
- Skipping escalation for high-risk incidents
- Allowing context drift between agents
- Direct production changes without approval
- Inadequate logging and audit trails
FAQ
What is the purpose of this AGENTS.md Template for Production Debugging Agents?
Provides a structured operating manual for single and multi-agent production debugging workflows, including memory, handoffs, and governance.
How does multi-agent orchestration apply to production debugging in this template?
It defines an orchestrator that coordinates planner, telemetry, repro, executor, validator, and reviewer roles with clear handoffs and escalation paths.
What are the key handoff rules between agents?
Handoffs occur at phase boundaries (plan, collect, reproduce, analyze, validate) and are governed by incident confidence, risk, and required approvals.
How does memory and source-of-truth get managed?
Context is stored in a central incident repository; runbooks capture outputs; the incident management system is the source of truth.
What security and production constraints does this template enforce?
Secrets are never exposed; use a secret manager; all actions are sandboxed and auditable; no unapproved production changes.