AGENTS.md Template for Chaos Engineering Reviews

Overview

Direct answer: This AGENTS.md template defines a chaos engineering reviews workflow with multi-agent orchestration, strict tool governance, and human review. It governs planning, injection, validation, and recovery across AI coding agents to assess resilience in staging while preventing production impact.

The template explains how individual agents operate and how they collaborate in a coordinated, multi-agent flow. It provides project-level operating context so you can paste the block into an AGENTS.md file and immediately start documenting roles, rules, and handoffs for chaos experiments.

When to Use This AGENTS.md Template

When you need a formal, repeatable chaos engineering review process that involves multiple agents and human oversight.
When you require clear handoffs between planner, injectors, validators, and reviewer to avoid context drift.
When you must enforce tool governance, secrets handling, and production safety while running simulated outages.
When you want a single source of truth for run plans, results, and rollback procedures.

Copyable AGENTS.md Template

# AGENTS.md

Project: Chaos Engineering Reviews for AI coding agents

Agent roster and responsibilities:
- Planner-Agent: defines experiment plan, success criteria, and orchestrator signals.
- Chaos Engineer Agent: executes chaos injections in a controlled, isolated environment.
- Monitor-Agent: collects telemetry (metrics, logs, traces) during the experiment.
- Validator-Agent: verifies results against acceptance criteria and flags anomalies.
- Recovery-Agent: performs rollback and reverts to a safe state if needed.
- Reviewer-Agent: conducts formal review of results and documentation; escalates when safeguards fail.

Supervisor or orchestrator behavior:
- The Planner-Agent initiates the run, publishes a plan, and delegates injections to the Chaos Engineer Agent.
- The Monitor-Agent streams telemetry; Validator-Agent analyzes signals and decides pass/fail.
- If failures occur, the Recovery-Agent triggers rollback; the Planner-Agent coordinates post-run analysis.
- Handoffs occur at plan approval, post-injection, post-validation, and post-rollback.

Context, memory, and source-of-truth rules:
- Context is scoped to a run_id and stored in the workspace; memory persists in a persistent store.
- Source of truth: runbook, telemetry dashboards, and results artifacts.
- Do not overwrite validated results; every decision must reference a stored artifact.

Tool access and permission rules:
- Chaos tooling access (injection, resource manipulation) is gated by RBAC and stage checks.
- Access to metrics and logs is restricted to approved agents.
- Secrets must be handled via a vault; do not log secrets.

Architecture rules:
- Run experiments in isolated namespaces; production systems are never mutated directly.
- Use canary or blue/green approaches for live tests where possible.

File structure rules:
- plans/plan-.md
- runs/run-/
  - logs/
  - results.json
  - artifacts/
- monitors/
- injectors/
- validators/
- recovery/
- docs/

Data, API, or integration rules:
- All data must be stored with run_id; API keys must be stored in vaults and referenced by agents.
- Do not hardcode credentials in code or config.

Validation rules:
- Pass if metrics meet acceptance criteria within the defined window; otherwise fail and trigger rollback.
- Validate that the system returns to a safe state after recovery.

Security rules:
- Enforce least privilege; no direct prod access without explicit approvals.
- Secrets must never be printed or logged.

Testing rules:
- Unit tests for agents; integration tests for the end-to-end chaos flow; smoke tests after deployment to staging.

Deployment rules:
- Run chaos reviews on staging; only promote experiments with approvals; require human sign-off for production-related tests.

Human review and escalation rules:
- Escalate to on-call SRE if safety thresholds are breached or if rollback fails.
- All decisions logged with run_id; use a post-mortem doc if failures occur.

Failure handling and rollback rules:
- Rollback to previous safe state using Recovery-Agent; verify restoration with Validator-Agent before resuming.
- If rollback fails, escalate and halt further injections until manual intervention.

Things Agents must not do:
- Do not mutate production data directly; do not bypass approvals; do not disclose secrets in logs; do not run long-running injections without supervision.

Recommended Agent Operating Model

The agent operating model defines distinct roles with clear decision boundaries and escalation paths. The Planner-Agent acts as the decision-maker for run scope and thresholds; Chaos Engineer-Agent performs injections; Monitor-Agent collects telemetry; Validator-Agent makes go/no-go decisions; Recovery-Agent handles rollback; Reviewer-Agent provides final approvals and documentation checks. Escalation paths go to a human on-call SRE or engineering lead when a safety threshold is breached or when an anomalous result cannot be trusted.

Recommended Project Structure

chaos-engineering-reviews/
├── plans/
│   └── plan-.md
├── runs/
│   └── run-/
│       ├── logs/
│       │   └── *.log
│       ├── results.json
│       └── artifacts/
├── monitors/
├── injectors/
├── validators/
├── recovery/
├── docs/
├── scripts/
└── templates/
    └── matrix.md

Core Operating Principles

Operate only in staging or isolated environments; never mutate production without explicit approval.
Maintain a single source of truth for each run; reference artifacts for decisions.
Limit agent privileges to what is strictly needed for the task.
Keep experiments auditable with clear run IDs and owner mappings.
Prefer safe, reversible changes and automated rollback where possible.

Agent Handoff and Collaboration Rules

Planner → Chaos Engineer: handoff after plan acceptance and readiness check.
Chaos Engineer → Monitor/Validator: handoff once injection is executed and telemetry is streaming.
Monitor → Validator: handoff when enough signals are collected to decide on success.
Validator → Recovery (if needed): handoff on fail; Recovery ensures rollback and rechecks state.
Recovery → Planner: handoff after rollback verification and post-run analysis.

Tool Governance and Permission Rules

Injection tools and resource manipulations require role-based approvals and stage gating.
All tool actions must be logged and replayable for audits.
Secrets must be accessed via vaults; avoid printing or logging secrets.
External services must be accessed through approved endpoints with time-limited credentials.
Production access requires explicit human authorization and business justification.

Code Construction Rules

Code representing experiments must be under version control; use descriptive commit messages.
Config and manifests must be templated and parameterized; avoid hardcoded values.
Use idempotent operations for injections and rollbacks to prevent drift.
All code must include tests that cover success, failure, and rollback paths.

Security and Production Rules

Apply the principle of least privilege to all agents; segregate duties.
No production data access without approval; sanitize data before export.
Audit trails must be preserved for all chaos experiments.
Security reviews are required before any production-related experiment is attempted.

Testing Checklist

Unit tests for each agent with deterministic inputs.
Integration tests for the end-to-end chaos flow in staging.
Smoke tests after deployment to ensure non-disruptive operation.
Validation tests to confirm rollback restores the safe state.

Common Mistakes to Avoid

Running experiments directly in production without approvals.
Bypassing handoffs or omitting context history.
Leaks of secrets or credentials into logs or artifacts.
Unbounded automation that makes uncontrolled changes.

FAQ

What is the purpose of this AGENTS.md Template for Chaos Engineering Reviews?

It defines a repeatable, auditable chaos engineering review workflow for AI coding agents with multi-agent collaboration and governance.

Who are the agents and what do they do?

Planner-Agent plans, Chaos Engineer-Agent injects faults, Monitor-Agent collects telemetry, Validator-Agent evaluates results, Recovery-Agent rolls back, Reviewer-Agent signs off and documents.

How are handoffs managed between agents?

Handoffs occur at plan approval, post-injection, post-validation, and post-rollback to preserve context and accountability.

What are the security guidelines for chaos experiments?

Run in staging; use RBAC; secrets via vault; no production mutations without approval and a rollback plan.

What constitutes a successful chaos review?

All acceptance criteria met, rollback verified, artifacts stored with run_id; no unresolved anomalies.

Target User

Use Cases