AGENTS.md Template for Data Center Failover Strategy

Overview

Direct answer: This AGENTS.md Template for Data Center Failover Strategy defines a formal operating manual for AI coding agents that manage data center failover workflows. It supports both single-agent execution and multi-agent orchestration across primary and backup sites.

What it governs: an orchestrated pattern where a small planner coordinates actions across an executor, verifier, and observer, with clear handoffs and a shared memory/context model. It enforces tool governance, source of truth, and escalation to human operators when needed.

When to Use This AGENTS.md Template

Use when you need a reproducible, auditable data center failover workflow that can be executed by AI coding agents.
Use when you require multi-agent orchestration with explicit handoff rules and supervision.
Use when you want a single source of truth for architecture, memory, and tool access across the failover pipeline.

Copyable AGENTS.md Template

# AGENTS.md
Project role: Data Center Failover Bot Master
Agent roster and responsibilities:
- planner: defines the failover plan considering site status, replication lag, and RPO/RTO
- executor: triggers actions (DNS switch, LB reconfig, VM failover)
- verifier: validates post-failover state
- observer: monitors telemetry and alerts on drift
- auditor: logs decisions and enforces compliance
Supervisor or orchestrator behavior:
- The orchestrator sequences steps, maintains the canonical memory, and routes handoffs
- Pauses on human approval when needed
Handoff rules between agents:
- Planner -> Executor: plan delivery
- Executor -> Verifier: actions completed
- Verifier -> Auditor: verification results
Context, memory, and source-of-truth rules:
- Memory is per failover run; sources are runbook, telemetry dashboards, and inventory records
- Source of truth is the canonical runbook and live telemetry
Tool access and permission rules:
- Access to orchestration APIs, DNS, load balancers; secrets via vault
- Actions restricted to approved endpoints
Architecture rules:
- Stateless executors with a shared memory module for the run
File structure rules:
- Separate agents, configs, docs; no hard-coded secrets
Data, API, or integration rules:
- Prefer idempotent calls; use retries with backoff
Validation rules:
- Validate end state matches the runbook
Security rules:
- Secrets are never logged; production actions require approval gates
Testing rules:
- Unit tests for plan generation; integration tests for sequences
Deployment rules:
- Rollback procedures; revert DNS/LB changes safely
Human review and escalation rules:
- Humans can approve critical steps; escalate on policy violations
Failure handling and rollback rules:
- Revert to previous known-good state; snapshot critical data if needed
Things Agents must not do:
- Do not perform destructive actions without explicit approval

Recommended Agent Operating Model

The agent roles cooperate under a central Planner that defines the failover strategy and a finite state machine that determines handoff points. Decision boundaries are explicit: planners generate plans, executors perform actions, verifiers check outcomes, and human operators can intervene at escalation points. Escalation paths are defined if telemetry indicates data divergence or unsafe states.

Recommended Project Structure

dc-failover/
├── agents/
│   ├── planner/
│   │   └── plan.yaml
│   ├── executor/
│   │   └── executor.py
│   ├── verifier/
│   │   └── verifier.py
│   ├── observer/
│   │   └── observer.py
│   └── auditor/
│       └── auditor.py
├── configs/
│   └── failover-runbook.yaml
├── docs/
│   └── AGENTS.md
├── tests/
│   └── test_failover.py
└── src/
    └── libs/

Core Operating Principles

Operate with a single source of truth per failover run.
Make all actions idempotent and replayable.
Keep decisions deterministic and auditable with clear logs.
Isolate memory to the current run; avoid cross-run leakage.
Require human review for high-risk or policy-violating steps.

Agent Handoff and Collaboration Rules

Planner hands off to Executor only after a validated plan; Executor hands off to Verifier after actions complete; Verifier hands off to Auditor with a full result, and Auditor surfaces to Human Review if issues arise. Domain specialists provide input where needed, and Researchers may supply telemetry hypotheses during runbook tuning.

Tool Governance and Permission Rules

Only approved tools may be invoked. DNS and LB changes require orchestrator authorization; secrets must be retrieved from a vault. All API calls must be logged with run identifiers and session IDs. No production changes without approval gates.

Code Construction Rules

Write modular, idempotent code; avoid hard-coded endpoints; respect runbook constraints; use the canonical data sources; validate inputs and outputs; include retry and circuit-breaker logic.

Security and Production Rules

Secrets encryption at rest and in transit; access control lists per role; audit logging for all failover actions; deployment must pass security checks before production. Production changes must be reversible and logged for traceability.

Testing Checklist

Unit tests for strategies and helper components.
Integration tests for end-to-end failover sequences.
Smoke tests in staging mirroring production topology.
Rollback tests to verify safe revert paths.

Common Mistakes to Avoid

Assuming telemetry is always accurate; implement verification checks.
Missing explicit handoffs between agents.
Skipping human review on high-risk steps.

FAQ

What is the purpose of this AGENTS.md Template for Data Center Failover Strategy?

It defines the operating manual for AI coding agents coordinating a data center failover workflow, including multi-agent handoffs and governance.

How does multi-agent orchestration work in this template?

A Planner generates a failover plan that Executors execute, Verifier validates, and Observers monitor; Handoff points ensure clean transitions and auditable traces.

How are handoffs enforced between agents?

Handoffs occur only at predefined boundaries; the orchestrator routes plan, actions, and verification results to the next agent and requires confirmations before proceeding.

How is success validated and what happens on failure?

Success is validated against the runbook and telemetry; on failure, rollback procedures are invoked and human review may be triggered.

What security practices does this template enforce?

Secrets are never logged, production actions require approval gates, and all actions are auditable via a centralized vault.

Target User

Use Cases