AGENTS.md Template for Google Cloud Production System Design

Overview

Direct answer: This AGENTS.md Template codifies a Google Cloud production system design workflow, enabling single-agent and multi-agent orchestration with clear roles, handoffs, and governance.

This page provides a copyable AGENTS.md template that you can paste into a project so teams can operate a design, implementation, review, testing, and deployment loop for AI coding agents on Google Cloud.

When to Use This AGENTS.md Template

Standardize production system design for AI coding agents on Google Cloud across teams.
Document orchestration patterns for multi agent coordination using GCP services such as Cloud Run, GKE, Cloud Functions, Pub/Sub, and Cloud Build.
Serve as an auditable playbook for architecture decisions, tool governance, and handoffs.
Provide a canonical context for humans and agents to operate within defined boundaries and safety rules.

Copyable AGENTS.md Template

# AGENTS.md
Project role: Cloud Infra Architect, DevOps Engineer, Cloud Security Lead, SRE, Data Engineer
Agent roster and responsibilities:
- Planner Agent: defines the design and orchestration plan and aligns with Google Cloud security baselines
- Implementer Agent: codifies infrastructure as code using Terraform and configures Google Cloud resources in prod-safe patterns
- Reviewer Agent: reviews IaC for correctness, security baselines, and policy conformance
- Tester Agent: runs unit and integration tests against the design and deployment pipeline
- Researcher Agent: gathers official Google Cloud docs and reference architectures
- Domain Specialist Agent: ensures data governance, privacy, and regulatory compliance
- Operator Agent: monitors prod deployments and triggers runbooks

Supervisor or orchestrator behavior:
- The Orchestrator Agent coordinates tasks, enforces memory rules, and maintains the source of truth

Handoff rules between agents:
- Planner to Implementer when design is approved
- Implementer to Reviewer after IaC draft
- Reviewer to Tester for integration validation
- Domain Specialist to Reviewer for security and compliance review
- Researcher to Planner for updated guidance
- Operator coordinates runbooks and production readiness with all agents

Context, memory, and source-of-truth rules:
- Use a single source of truth in a remote Git repository with remote Terraform state
- Memory is scoped to an orchestration run with explicit decision logs stored in a central logs bucket

Tool access and permission rules:
- Least privilege service accounts for gcloud calls, secret access in Secret Manager, and KMS keys
- No hard coded credentials; all secrets retrieved at runtime
- Deployments require policy checks and approvals

Architecture rules:
- Google Cloud managed services with clean separation of concerns
- Use Pub/Sub for coordination, Cloud Run or GKE for workloads, and Cloud Build for CI CD
- Centralized monitoring with Cloud Monitoring and logging with Cloud Logging

File structure rules:
- Infrastructure as code modules in infra/modules; environment specific configs in environments/prod|staging
- Application code under apps; pipelines under pipelines
- Documentation under docs

Data, API, or integration rules:
- Data artifacts stored in Cloud Storage with proper IAM controls
- APIs governed by IAM roles and service accounts; secrets in Secret Manager

Validation rules:
- terraform validate and terraform plan; policy checks; unit tests for modules
- integration tests in a staging environment before prod

Security rules:
- VPC Service Controls; all secrets encrypted with KMS
- production deployments gated by approvals and audit trails

Testing rules:
- unit, integration, end-to-end tests; canary deployments; rollback tests

Deployment rules:
- CI CD pipelines in Cloud Build; canary first; promote with approvals
- rollback plan and timeboxed rollbacks

Human review and escalation rules:
- Human review required for prod deployments; escalation to SRE and Security

Failure handling and rollback rules:
- If failures detected, rollback resources and revert code; notify owners; switch to safe mode

Things Agents must not do:
- Do not bypass approvals; do not store secrets in code; do not drift away from the source of truth

Recommended Agent Operating Model

Roles and decision boundaries: planner defines architecture; implementer creates resources; reviewer validates; tester confirms; domain specialist ensures security; operator monitors; researchers gather knowledge. Escalation paths: to cloud orchestrator for blocking issues; to SRE for production incidents.

Recommended Project Structure

Workflow specific directory tree follows a modular Google Cloud native IaC approach focusing on prod readiness:

gcp-prod-agents-md-template/
  infra/
    modules/
      vpc/
      network-security/
      compute/
    environments/
      prod/
        main.tf
        variables.tf
        outputs.tf
        backend.tf
  apps/
    ingest-service/
      main.tf
      variables.tf
  pipelines/
    ci-cd/
      cloudbuild.yaml
  docs/
    ops-notes.md
  scripts/
    bootstrap.sh

Core Operating Principles

Single source of truth for design decisions and IaC state
Idempotent, auditable actions with clear versioning
Least privilege and secret management enforced by design
Explicit memory of decisions with time-stamped logs
Human-in-the-loop for prod changes and incident response

Agent Handoff and Collaboration Rules

Planner to Implementer for design realization
Implementer to Reviewer for IaC validation
Reviewer to Tester for integration checks
Domain Specialist to Reviewer for security and compliance
Researcher to Planner for updates
Operator to all for runbooks and prod readiness

Tool Governance and Permission Rules

CLI tools using service accounts with scoped roles
Secrets accessed via Secret Manager; never embedded in code
APIs accessed through restricted IAM policies and audit logging
Production changes gated by approvals and automated policy checks
Rollback and kill-switch procedures in place for outages

Code Construction Rules

Modular IaC using Terraform; modules in infra/modules
Outputs define downstream dependencies; inputs parameterized
Parallelizable tasks where possible; avoid race conditions
Idempotent apply operations; drift detection enabled
Documentation primed by code; changelogs maintained

Security and Production Rules

VPC service controls and private endpoints for prod networks
Data encrypted at rest with Cloud KMS; transit encryption enforced
Audit trails for every deploy; incident response runbooks
Access controls based on least privilege; mandatory MFA for sensitive actions
Regular rotation of credentials and secrets

Testing Checklist

Terraform validate and terraform plan; static checks
Unit tests for modules; integration tests in staging
Canary deployments and feature flag tests
End-to-end tests for data flows and APIs
Security and compliance scans; dependency checks
Disaster recovery drills and rollback verification

Common Mistakes to Avoid

Skipping approvals or bypassing policy checks
Hard coding secrets or credentials in code
Architectural drift between design document and prod
Inadequate monitoring or insufficient logging
Untracked changes to infrastructure state

FAQ

What is this AGENTS.md Template for Google Cloud Production System Design?

It provides a formal operating manual to govern single-agent and multi-agent workflows on Google Cloud, including roles, handoffs, tool governance, and escalation paths.

How does multi-agent orchestration work in this template?

It defines an orchestrator and an agent roster with explicit handoff points, memory rules, and source-of-truth to coordinate tasks across GCP services.

What tools and services are governed by this template?

Cloud IAM, Secret Manager, Cloud Storage, Terraform or Deployment Manager, Cloud Build, Cloud Run, GKE, Pub/Sub, and monitoring/logging services.

How are security and production rules enforced?

By least privilege access, secret management, encryption, policy checks, approval gates, and controlled deployment workflows with audit trails.

What are the escalation paths if something goes wrong?

Escalate to the Cloud Orchestrator, trigger a rollback, halt non-critical services, and notify SRE/security teams for remediation.

Target User

Use Cases