Disaster Recovery Architecture AGENTS.md Template

Q: What are the top security and governance rules in this template?

Secrets must be retrieved via orchestrator governance, production changes require approvals, and no agent overrides runbooks or bypasses tests.

Overview

The AGENTS.md Template for Disaster Recovery Architecture provides a formal operating manual for AI coding agents to execute, monitor, and govern disaster recovery workflows. It covers single-agent and multi-agent orchestration, including planner, implementer, tester, and domain specialists, with explicit handoffs, memory, and source-of-truth rules.

When to Use This AGENTS.md Template

When you need a repeatable DR procedure that can be executed by AI agents and humans alike.
When you require explicit handoffs and escalation paths between planner, implementer, tester, and reviewers.
When tool governance, secret handling, and secure deployment rules must be enforced.
When documenting the operating model for disaster recovery architecture in a single, copyable template.

Copyable AGENTS.md Template

# AGENTS.md

Project: Disaster Recovery Architecture

Agent roster and responsibilities:
- Planner: designs DR runbooks, defines RTO/RPO, coordinates tasks and handoffs.
- Implementer: executes failover/failback steps, applies configurations, rehydrating services.
- Tester: validates DR outcomes, records results, ensures services meet SLA.
- Recovery Specialist: confirms DR readiness, audits recovery state, signs off on restore validity.
- Domain Specialist (Networking/SRE): handles network failover, segmentation, DNS and service graph integrity.

Supervisor or orchestrator behavior:
- Orchestrator assigns work based on the DR runbook, tracks progress, enforces tool governance, and triggers validations.
- Maintains a single source of truth for state, decisions, and evidence.

Handoff rules between agents:
- Planner completes the DR plan and passes tasks to Implementer.
- Implementer completes tasks and passes results to Tester.
- Tester validates and passes to Recovery Specialist for sign-off.
- On failure, orchestrator triggers rollback and escalates to human review.

Context, memory, and source-of-truth rules:
- Source of truth: DR catalog, runbooks, monitoring dashboards, and config repos.
- Memory: all decisions and outputs are recorded in a shared, versioned store.
- Do not rely on ephemeral chat messages as sole memory.

Tool access and permission rules:
- Agents may not access production secrets directly; use a vault via the orchestrator.
- Access to cloud consoles and IaC must be gated by approvals and audit logging.
- Secrets and tokens must be rotated per policy.

Architecture rules:
- DR architecture should be modular (active-active or active-passive), with clear RTO/RPO targets and cutover logic.
- Use idempotent steps and deterministic outcomes wherever possible.

File structure rules:
- disaster-recovery/
  - orchestrator/
  - agents/
    - planner/
    - implementer/
    - tester/
    - recovery-specialist/
    - domain-specialist/
  - configs/
  - data/
  - tests/
  - docs/

Data, API, or integration rules when relevant:
- Inventory and service graphs must be synchronized with the orchestrator.
- DR APIs must be idempotent and follow versioned contracts.

Validation rules:
- Every step must be verifiable by tests and monitoring; results must be logged.
- Validation criteria must be defined in the DR runbook.

Security rules:
- Secrets are stored in vaults; no plaintext secrets in code or logs.
- All changes require approval and audit trail.

Testing rules:
- DR drills must be automated; results must be reportable.
- Regression tests must cover failover, failback, and data integrity checks.

Deployment rules:
- Changes to DR runbooks go through CI; deployments are auditable.
- Rollback procedures must be defined and tested.

Human review and escalation rules:
- If risk score exceeds threshold, escalate to on-call lead.
- All exceptions must be reviewed and approved prior to execution.

Failure handling and rollback rules:
- On error, rollback to last known-good state and verify reachability.
- Ensure data integrity checks pass before resuming services.

Things Agents must not do:
- Do not bypass approvals or execute non-DR tasks during DR window.
- Do not modify production data without explicit preservation steps.
- Do not mutate runbooks without documentation and supervisor sign-off.

Recommended Agent Operating Model

Roles, responsibilities, decision boundaries, and escalation paths are defined to support reliable DR outcomes. The Planner defines intent and boundaries, while the Implementer executes per plan. The Tester guards quality, and the Recovery Specialist validates the final state. Escalations flow to the Orchestrator and, if needed, to human on-call engineers.

Recommended Project Structure

disaster-recovery-architecture/
├─ orchestrator/
│  └─ orchestrator.py
├─ agents/
│  ├─ planner/
│  │  ├─ planner.py
│  │  └─ plan.md
│  ├─ implementer/
│  │  ├─ implementer.py
│  │  └─ config.yaml
│  ├─ tester/
│  │  ├─ tester.py
│  │  └─ tests/
│  ├─ recovery-specialist/
│  │  └─ validate.py
│  └─ domain-specialist/
│     ├─ networking/
│     └─ databases/
├─ configs/
│  └─ dr-config.yaml
├─ data/
│  └─ inventory.csv
├─ tests/
└─ docs/

Core Operating Principles

Single source of truth for DR state and evidence.
Idempotent, deterministic steps with auditable results.
Explicit memory, context, and source-of-truth; no hidden state in agents.
Clear, testable handoffs and escalation paths.
Secure by default: secrets, keys, and access are governed.

Agent Handoff and Collaboration Rules

Planner to Implementer: handoff plan artifacts, prerequisites, and success criteria.
Implementer to Tester: handoff restored services, data-state, and validation results.
Tester to Recovery Specialist: handoff validation report and readiness status.
Domain Specialist support as needed for network and data-layer issues.
Orchestrator enforces time-boxed steps and triggers human review when confidence is low.

Tool Governance and Permission Rules

All tool actions require orchestrator-authenticated commands; no direct agent-to-production tool access.
Secrets are retrieved through secure vaults; no secrets are stored in code or logs.
External service calls require approval gates and logging.
All changes are tracked with versioned artifacts and Rollback is defined.

Code Construction Rules

Implementations must follow explicit DR runbook steps, ensure idempotency, and keep operations auditable. Do not invent steps outside the approved DR plan.

Security and Production Rules

DR actions must be logged and monitored; alert thresholds must be defined.
Access to production systems requires multi-person approval for major changes.
Data ingress/egress rules must follow data governance policies.

Testing Checklist

Automated DR drills run on schedule with success criteria met.
Data integrity checks pass after failover and failback.
All steps validated against runbooks and monitoring dashboards.

Common Mistakes to Avoid

Skipping formal approvals or bypassing runbooks.
Inconsistent memory of decisions or hidden state in agents.
Overlooking data integrity during rollback.

FAQ

What is the purpose of this Disaster Recovery AGENTS.md Template?

It provides a formal operating manual for AI coding agents to govern disaster recovery workflows, enabling single-agent and multi-agent orchestration with defined roles, handoffs, and governance.

Who should use this template?

DR engineers, SREs, DevOps teams, and AI product teams who want a structured, auditable, and repeatable DR workflow.

How does multi-agent orchestration work in DR?

An Orchestrator coordinates Planner, Implementer, Tester, and Domain Specialists, with clear handoffs and shared context via a source-of-truth repository and controlled secret access.

What are the handoff rules between agents?

Handoffs occur sequentially: Planner -> Implementer -> Tester -> Recovery Specialist, with escalation to human review if confidence < threshold; all steps must be validated.

What are the top security and governance rules in this template?

Secrets must be retrieved via orchestrator governance, production changes require approvals, and no agent overrides runbooks or bypasses tests.

Target User

Use Cases