AGENTS.md TemplatesAGENTS.md Template

AGENTS.md Template for Observability Architecture

Copyable AGENTS.md Template for observability architecture guiding AI coding agents, multi-agent orchestration, and tool governance.

AGENTS.md TemplateAI coding agentsobservabilitymulti-agent orchestrationagent handoffstool governancehuman reviewobservability architecturetelemetrySRE automation

Target User

Developers, platform teams, SREs, and engineering leaders designing observability architecture with AI agents

Use Cases

  • Define agent roles for observability architecture
  • Coordinate multi-agent workflows for telemetry collection, correlation, alerting, and remediation
  • Establish governance and handoffs between agents and human reviewers

Markdown Template

AGENTS.md Template for Observability Architecture

# AGENTS.md

Project role
- Observability Architect leading AI-driven telemetry collection, correlation, alerting, and remediation in production.

Agent roster and responsibilities
- PlannerAgent: defines the observability workflow, sources, run cadence, and handoff points.
- TelemetryCollectorAgent: collects traces, metrics, and logs from configured sources.
- CorrelationAgent: correlates telemetry across services, namespaces, and environments.
- AnomalyDetectorAgent: detects drift or anomalies in traces, metrics, and logs.
- AlertRouterAgent: routes alerts to on-call tooling and human reviewers.
- RemediationAgent: triggers safe remediation actions (e.g., non-destructive flags, feature toggles, rollback hooks).
- ReviewerAgent: validates outputs, ensures governance policies are obeyed.
- DataValidationAgent: verifies data quality, schema conformance, and lineage.
- DomainSpecialistAgent: provides domain-specific rules and SLIs/SLOs for the service under observation.

Supervisor or orchestrator behavior
- Orchestrator maintains global memory, ensures idempotence, and enforces run-id-scoped state across agents.
- Enforces dependency order, backoff, and retry limits; halts on policy violation or critical errors.
- Logs all decisions and data access to a centralized memory store for traceability.

Handoff rules between agents
- PlannerAgent defines next tasks and passes context (run-id, sources, and goals) to TelemetryCollectorAgent.
- TelemetryCollectorAgent gathers data and forwards to CorrelationAgent with lineage information.
- CorrelationAgent passes context to AnomalyDetectorAgent and DataValidationAgent.
- If anomalies are detected, AlertRouterAgent is engaged; RemediationAgent may be invoked if approved.
- All outputs are reviewed by ReviewerAgent before final deployment or alert changes.

Context, memory, and source-of-truth rules
- Central memory store holds run state keyed by run-id; data producers are the sources of truth for telemetry.
- Outputs must cite data sources; lineage must be preserved across handoffs.
- Confidential data is never written to the shared memory in plaintext.

Tool access and permission rules
- Agents have read access to telemetry sources; write access only to memory and approved outputs.
- Secrets are retrieved from a secure vault; no plaintext secrets in code or memory.
- API calls must pass through the orchestrator with explicit approvals for production actions.

Architecture rules
- Design is modular, event-driven, and idempotent; components communicate via well-defined interfaces.
- All actions are auditable and replayable; avoid side-effects without explicit approval.

File structure rules
- All agent code lives under agents/ with one folder per agent role; avoid duplicating logic across agents.
- Shared utilities go under common/; configuration lives under configs/.

Data, API, or integration rules
- Use OpenTelemetry-compatible collectors; standardize data formats and schemas for telemetry.
- All API usage must adhere to rate limits and authentication requirements; never bypass auth.

Validation rules
- Validate schema conformance for all telemetry data; validate end-to-end workflow with run-id level checks.
- Assertions on data freshness and completeness before triggering downstream actions.

Security rules
- Secrets managed in vault; no plaintext credentials in repo or memory.
- Access governed by least privilege; production actions require explicit approval gates.

Testing rules
- Unit tests for each agent; integration tests for orchestrator and end-to-end flow.
- Test failure modes, retries, and rollback paths.

Deployment rules
- Automated deployment with canary strategy; roll back if telemetry or remediation paths regress.
- Observability checks must pass before promoting to production.

Human review and escalation rules
- Any anomaly above predefined thresholds triggers human review and escalation workflow.
- All escalations logged with run-id and context.

Failure handling and rollback rules
- On agent failure, re-run with a clean run-id or rollback to the last good state.
- Preserve artifacts and logs for audit.

Things Agents must not do
- Do not access secrets directly; do not perform destructive actions without approval.
- Do not bypass orchestrator or governance gates; do not drift away from the defined workflow.

Overview

This AGENTS.md Template defines a complete operating manual for an observability architecture workflow using AI coding agents. It governs both a single-agent execution and multi-agent orchestration patterns, outlining roles, handoffs, memory, tool governance, security, and escalation. It provides concrete, copyable operating context that teams can paste into an Observability project to align automation, data collection, correlation, alerting, and remediation across services.

Direct answer: This template formalizes a reproducible, auditable workflow for AI-driven observability, including agent roles, handoffs, memory, and governance, enabling reliable cross-agent collaboration and safe production usage.

When to Use This AGENTS.md Template

  • When designing an AI-driven observability pipeline with telemetry sources (traces, metrics, logs) and automated remediation paths.
  • When you need explicit agent rosters, handoff rules, and source-of-truth governance to prevent context drift.
  • When multi-agent orchestration is required to correlate, analyze, alert, and remediate without unsafe automation.
  • When you want a copyable, project-level context document that new team members can adopt quickly.

Copyable AGENTS.md Template

Copy the block below into your AGENTS.md and customize for your observability stack.

# AGENTS.md

Project role
- Observability Architect leading AI-driven telemetry collection, correlation, alerting, and remediation in production.

Agent roster and responsibilities
- PlannerAgent: defines the observability workflow, sources, run cadence, and handoff points.
- TelemetryCollectorAgent: collects traces, metrics, and logs from configured sources.
- CorrelationAgent: correlates telemetry across services, namespaces, and environments.
- AnomalyDetectorAgent: detects drift or anomalies in traces, metrics, and logs.
- AlertRouterAgent: routes alerts to on-call tooling and human reviewers.
- RemediationAgent: triggers safe remediation actions (e.g., non-destructive flags, feature toggles, rollback hooks).
- ReviewerAgent: validates outputs, ensures governance policies are obeyed.
- DataValidationAgent: verifies data quality, schema conformance, and lineage.
- DomainSpecialistAgent: provides domain-specific rules and SLIs/SLOs for the service under observation.

Supervisor or orchestrator behavior
- Orchestrator maintains global memory, ensures idempotence, and enforces run-id-scoped state across agents.
- Enforces dependency order, backoff, and retry limits; halts on policy violation or critical errors.
- Logs all decisions and data access to a centralized memory store for traceability.

Handoff rules between agents
- PlannerAgent defines next tasks and passes context (run-id, sources, and goals) to TelemetryCollectorAgent.
- TelemetryCollectorAgent gathers data and forwards to CorrelationAgent with lineage information.
- CorrelationAgent passes context to AnomalyDetectorAgent and DataValidationAgent.
- If anomalies are detected, AlertRouterAgent is engaged; RemediationAgent may be invoked if approved.
- All outputs are reviewed by ReviewerAgent before final deployment or alert changes.

Context, memory, and source-of-truth rules
- Central memory store holds run state keyed by run-id; data producers are the sources of truth for telemetry.
- Outputs must cite data sources; lineage must be preserved across handoffs.
- Confidential data is never written to the shared memory in plaintext.

Tool access and permission rules
- Agents have read access to telemetry sources; write access only to memory and approved outputs.
- Secrets are retrieved from a secure vault; no plaintext secrets in code or memory.
- API calls must pass through the orchestrator with explicit approvals for production actions.

Architecture rules
- Design is modular, event-driven, and idempotent; components communicate via well-defined interfaces.
- All actions are auditable and replayable; avoid side-effects without explicit approval.

File structure rules
- All agent code lives under agents/ with one folder per agent role; avoid duplicating logic across agents.
- Shared utilities go under common/; configuration lives under configs/.

Data, API, or integration rules
- Use OpenTelemetry-compatible collectors; standardize data formats and schemas for telemetry.
- All API usage must adhere to rate limits and authentication requirements; never bypass auth.

Validation rules
- Validate schema conformance for all telemetry data; validate end-to-end workflow with run-id level checks.
- Assertions on data freshness and completeness before triggering downstream actions.

Security rules
- Secrets managed in vault; no plaintext credentials in repo or memory.
- Access governed by least privilege; production actions require explicit approval gates.

Testing rules
- Unit tests for each agent; integration tests for orchestrator and end-to-end flow.
- Test failure modes, retries, and rollback paths.

Deployment rules
- Automated deployment with canary strategy; roll back if telemetry or remediation paths regress.
- Observability checks must pass before promoting to production.

Human review and escalation rules
- Any anomaly above predefined thresholds triggers human review and escalation workflow.
- All escalations logged with run-id and context.

Failure handling and rollback rules
- On agent failure, re-run with a clean run-id or rollback to the last good state.
- Preserve artifacts and logs for audit.

Things Agents must not do
- Do not access secrets directly; do not perform destructive actions without approval.
- Do not bypass orchestrator or governance gates; do not drift away from the defined workflow.

Recommended Agent Operating Model - Roles and responsibilities: Planner coordinates, Collectors gather data, Correlators unify events, Analyzers detect anomalies, Reviewers validate, Remediation acts, DomainExperts tailor rules. - Decision boundaries: Only the Planner may modify workflow; domain rules enforced by DomainSpecialistAgent. - Escalation paths: anomalies escalate to on-call and human reviewer; security incidents escalate to incident response. Recommended Project Structure
observability-agents/
├── agents/
│   ├── planner/
│   ├── collector/
│   ├── correlator/
│   ├── anomaly_detector/
│   ├── alerting/
│   ├── remediation/
│   ├── reviewer/
│   └── data_validation/
├── orchestrator/
├── memory/
├── sources/
├── integrations/
├── configs/
├── docs/
└── tests/
Core Operating Principles - Clarity: every agent has a finite, auditable responsibility. - Safety: strict gating for production changes; no circumvention of quotas or approvals. - Determinism: idempotent tasks and reproducible artifacts. - Traceability: full run-id history with data provenance. - Minimal surface area: least privilege for all tools and services. Agent Handoff and Collaboration Rules - Planner → Collector: pass run-id, sources, goals, and schema. - Collector → Correlator: pass collected traces, metrics, and logs with lineage. - Correlator → AnomalyDetector: pass context and thresholding rules. - AnomalyDetector → Reviewer/AlertRouter: route findings and justify decisions. - Reviewer → Orchestrator: approve outputs or request changes. - DomainSpecialistAgent: inject domain-specific SLOs and thresholds before final actions. Tool Governance and Permission Rules - Command execution requires orchestrator approval for production actions. - File edits restricted to agents with write permission; no direct filesystem edits by data-gathering agents. - API calls audited, limited by tokens with expiry; secrets stored in vaults. - Production services can be touched only through approved endpoints and change management gates. - All changes require a prior run-id and post-action validation. Code Construction Rules - Idempotent functions; deterministic outputs; avoid hard-coded values without configuration. - All data formats standardized; use schemas for telemetry and events. - Log and audit every decision; never log secrets. Security and Production Rules - Secrets managed in vault; rotate regularly and segregate duties. - Production workflows require approval gates and incident-response readiness. - No cross-project secrets; isolation boundaries enforced. Testing Checklist - Unit tests for each agent; integration tests for orchestration flow; end-to-end tests for telemetry path. - Regression tests for handoffs and state persistence; failover tests for node or service outages. - Security tests for access controls and secret handling. Common Mistakes to Avoid - Skipping governance gates or bypassing the orchestrator. - Duplicating logic across agents; no single source of truth drift. - Ignoring memory management and run-id scoping. - Exposing secrets in logs or in code. FAQ

What is the purpose of this AGENTS.md Template for Observability Architecture?

It provides a copyable, project-level operating manual to govern AI coding agents in an observability workflow, including roles, handoffs, memory, tool governance, and escalation paths.

How do multi-agent orchestrations coordinate data collection and correlation?

Orchestrator defines run flow, coordinates task handoffs, preserves context in a central memory store, and enforces gates before transitioning between agents.

How is memory and source-of-truth managed?

Telemetry sources are trusted sources of truth; memory is scoped by run-id, and outputs reference source provenance. Secrets never reside in memory as plaintext.

What are the security guidelines for secrets and production access?

Secrets live in a vault; agents have least-privilege access; production actions require explicit approvals and audit trails.

How are failures handled and rollbacks executed?

Failures trigger retries with backoff, state rollback to last good run-id, and escalation to human reviewers when needed.

Related implementation resources: AI Use Case for Sales Pipeline Reviews and Deal Risk Scoring and AI Use Case for Ndas and Risk Flagging.