ETL Pipeline AGENTS.md Template
Copyable AGENTS.md Template for ETL pipeline architecture with multi-agent orchestration, handoffs, tool governance, and human review.
Target User
Developers, data engineers, engineering leaders
Use Cases
- ETL pipeline orchestration with multi-agent coordination
- role-based agent responsibilities and handoffs
- tool governance, secrets management, and security in ETL workflows
- human review escalation and rollback planning
Markdown Template
ETL Pipeline AGENTS.md Template
# AGENTS.md
Project role
- ETL Platform Engineer (ETL Architect): owner of the pipeline architecture, interfaces, and governance.
Agent roster and responsibilities
- Planner/Orchestrator: sequences ETL steps (extract → transform → load), enforces dependencies, retries with backoff, and triggers validation and monitoring.
- Extract Agent: connects to source systems, performs incremental extraction, handles idempotency, writes to staging.
- Transform Agent: applies cleansing, normalization, business rules, and schema conformance.
- Load Agent: writes to target data store (data warehouse/data lake), manages partitions, and ensures atomic upserts.
- Validator Agent: enforces data quality checks, schema validation, and lineage verification; raises remediation requests on anomalies.
- Monitor Agent: collects throughput, latency, failure signals; emits metrics to observability stacks.
- Security Reviewer: audits credentials usage, restricts access, and ensures policy compliance.
Supervisor or orchestrator behavior
- The Planner maintains a run plan, enforces data dependencies, and triggers agents via a message bus. It halts on failures, retries with backoff up to a max, and escalates to human review when needed.
Handoff rules between agents
- After Extract completes, pass context (run_id, source_system, batch_id, version) to Transform.
- After Transform completes, hand off to Load with target_table, partition, and data_model_version.
- After Load completes, pass to Validator for quality checks.
- If Validator passes, notify Monitor; if it fails, trigger remediation and alerting.
Context, memory, and source-of-truth rules
- Context is stored in a metadata store with keys: run_id, batch_id, source_system, version, and timestamp.
- Memory is transient per run; source-of-truth is the data catalog and data lake/warehouse with lineage tracked.
Tool access and permission rules
- Access to sources via secret manager; production credentials require explicit approvals.
- Agents may call only permitted APIs; read-only or restricted write permissions per role; secrets rotated on schedule.
Architecture rules
- Modular ETL stages with well-defined interfaces; decoupled via a orchestration layer or message bus; standardized data formats and schemas.
File structure rules
- /etl-pipelines/etl-pipeline/
- configs/
- docs/
- src/
- extract/
- transform/
- load/
- tests/
- agents/
- workflows/
Data, API, or integration rules
- Data formats: Parquet/Avro for columns; JSON for API payloads; versioned schemas.
- Source connectors and APIs must be auditable; respect rate limits and retries; idempotent writes preferred.
Validation rules
- Row counts and schema checks; non-null primary keys; quality gates with configurable tolerances; end-to-end validation in staging before prod.
Security rules
- Encryption at rest and in transit; least privilege access; secret rotation; audit logging and anomaly detection.
Testing rules
- Unit tests for each ETL step; integration tests for connectors; end-to-end tests in staging; automated regression checks.
Deployment rules
- CI/CD gates; canary or blue/green deployments for schema changes; feature flags for new transformations; rollback plans in production.
Human review and escalation rules
- Escalate to data governance or data owner for schema changes, sensitive data access changes, or failedQuality gates beyond retry limits.
Failure handling and rollback rules
- On failure, revert to last-good checkpoint; preserve run log; notify stakeholders and trigger remediation tasks; ensure checkpoint recoverability.
Things Agents must not do
- Do not bypass approvals, secrets, or policy checks. Do not modify production data outside approved ETL runs. Do not reuse credentials or escalate privileges. Do not drift from the defined data contracts.Overview
This AGENTS.md Template defines an ETL pipeline architecture governed by AI coding agents and multi-agent orchestration. It prescribes roles, handoffs, governance, and human-in-the-loop review to ensure reliable, auditable data processing from extract to load while enabling scalable collaboration among agents and human experts. It supports both single-agent execution and full multi-agent orchestration with clear boundaries and source-of-truth rules.
Direct answer: The ETL pipeline AGENTS.md Template outlines the agent roles, coordination rules, data governance, and operational constraints needed to run reliable ETL workflows with AI-powered agents and explicit handoffs.
When to Use This AGENTS.md Template
- Designing a new ETL data pipeline with automated orchestration across extract, transform, and load stages.
- Establishing a repeatable, governance-backed agent operating model for data pipelines.
- Defining clear handoff rules and source-of-truth across agents to avoid context drift.
- Implementing tool governance, secrets management, and security controls in ETL workflows.
- Creating a project-wide AGENTS.md that teams can copy for future ETL pipelines and multi-agent patterns.
Copyable AGENTS.md Template
# AGENTS.md
Project role
- ETL Platform Engineer (ETL Architect): owner of the pipeline architecture, interfaces, and governance.
Agent roster and responsibilities
- Planner/Orchestrator: sequences ETL steps (extract → transform → load), enforces dependencies, retries with backoff, and triggers validation and monitoring.
- Extract Agent: connects to source systems, performs incremental extraction, handles idempotency, writes to staging.
- Transform Agent: applies cleansing, normalization, business rules, and schema conformance.
- Load Agent: writes to target data store (data warehouse/data lake), manages partitions, and ensures atomic upserts.
- Validator Agent: enforces data quality checks, schema validation, and lineage verification; raises remediation requests on anomalies.
- Monitor Agent: collects throughput, latency, failure signals; emits metrics to observability stacks.
- Security Reviewer: audits credentials usage, restricts access, and ensures policy compliance.
Supervisor or orchestrator behavior
- The Planner maintains a run plan, enforces data dependencies, and triggers agents via a message bus. It halts on failures, retries with backoff up to a max, and escalates to human review when needed.
Handoff rules between agents
- After Extract completes, pass context (run_id, source_system, batch_id, version) to Transform.
- After Transform completes, hand off to Load with target_table, partition, and data_model_version.
- After Load completes, pass to Validator for quality checks.
- If Validator passes, notify Monitor; if it fails, trigger remediation and alerting.
Context, memory, and source-of-truth rules
- Context is stored in a metadata store with keys: run_id, batch_id, source_system, version, and timestamp.
- Memory is transient per run; source-of-truth is the data catalog and data lake/warehouse with lineage tracked.
Tool access and permission rules
- Access to sources via secret manager; production credentials require explicit approvals.
- Agents may call only permitted APIs; read-only or restricted write permissions per role; secrets rotated on schedule.
Architecture rules
- Modular ETL stages with well-defined interfaces; decoupled via a orchestration layer or message bus; standardized data formats and schemas.
File structure rules
- /etl-pipelines/etl-pipeline/
- configs/
- docs/
- src/
- extract/
- transform/
- load/
- tests/
- agents/
- workflows/
Data, API, or integration rules
- Data formats: Parquet/Avro for columns; JSON for API payloads; versioned schemas.
- Source connectors and APIs must be auditable; respect rate limits and retries; idempotent writes preferred.
Validation rules
- Row counts and schema checks; non-null primary keys; quality gates with configurable tolerances; end-to-end validation in staging before prod.
Security rules
- Encryption at rest and in transit; least privilege access; secret rotation; audit logging and anomaly detection.
Testing rules
- Unit tests for each ETL step; integration tests for connectors; end-to-end tests in staging; automated regression checks.
Deployment rules
- CI/CD gates; canary or blue/green deployments for schema changes; feature flags for new transformations; rollback plans in production.
Human review and escalation rules
- Escalate to data governance or data owner for schema changes, sensitive data access changes, or failedQuality gates beyond retry limits.
Failure handling and rollback rules
- On failure, revert to last-good checkpoint; preserve run log; notify stakeholders and trigger remediation tasks; ensure checkpoint recoverability.
Things Agents must not do
- Do not bypass approvals, secrets, or policy checks. Do not modify production data outside approved ETL runs. Do not reuse credentials or escalate privileges. Do not drift from the defined data contracts.
Recommended Agent Operating Model
The ETL agent operating model assigns clear roles with decision boundaries and escalation paths. A Planner orchestrates but does not execute data transformations directly; Extract/Transform/Load agents implement business logic. Validators and Monitors provide checks and visibility. Escalations move to human review when automation cannot resolve issues within defined SLAs.
Recommended Project Structure
etl-pipeline/
├── configs/
├── docs/
├── src/
│ ├── extract/
│ ├── transform/
│ └── load/
├── tests/
└── agents/
├── planner.md
├── extractor.md
├── transformer.md
├── loader.md
├── validator.md
└── monitor.md
Core Operating Principles
- Idempotent steps and deterministic results for each run.
- Clear contracts between agents with explicit input/output schemas.
- Source-of-truth and lineage are maintained across runs.
- Principle of least privilege for all tool access and secrets.
- Always escalate when quality or security gates fail after retries.
Agent Handoff and Collaboration Rules
- Planner → Extractor: pass run_id, source_system, batch_id, version, and last_seen.
- Extractor → Transformer: pass staged data references and metadata; ensure idempotency keys flow.
- Transformer → Loader: supply target schema, partitions, and data model version.
- Loader → Validator: provide counts, schema snapshot, and expected data quality gates.
- Validator → Monitor: emit health signals and success/failure status.
- On failure: Planner triggers remediation, or escalates to human review.
Tool Governance and Permission Rules
- Commands must be scoped to the permitted ETL operations; no admin actions in pipelines.
- File edits follow a review gate; production changes require approval.
- API calls to sources/targets must use read/write permissions aligned with role.
- Secrets must be accessed via a central secret manager with rotation policies.
- Production endpoints require feature flags and observability signals.
- Handoff and data contracts must be versioned.
Code Construction Rules
- Use explicit input/output schemas for every agent action.
- Keep transforms pure; avoid side effects outside the pipeline store.
- Implement idempotent writes and safe retries with backoff.
- Log sufficient context for auditability in every step.
- Validate data types and nullability at each stage.
Security and Production Rules
- Encrypt data in transit and at rest; enforce least privilege for all agents.
- Rotate credentials on a scheduled cadence; monitor for suspicious access.
- No hard-coded secrets; use a secure vault with access controls.
- Implement audit trails for all data-access events.
Testing Checklist
- Unit tests for extract/transform/load logic.
- Integration tests for connectors and data formats.
- End-to-end tests in staging with representative data volumes.
- Regression tests for schema evolution and backward compatibility.
- Performance tests to verify throughput and latency SLAs.
Common Mistakes to Avoid
- Skipping data quality checks and no lineage visibility.
- Allowing uncontrolled handoffs without contracts or versioning.
- Bypassing security reviews or secret management during deployments.
- Failing to implement idempotency and checkpointing in ETL steps.
- Overloading the planner with too many direct transformations.
Related implementation resources: AI Use Case for Sales Pipeline Reviews and Deal Risk Scoring and AI Agent Use Case for Pharmaceutical Producers Using Batch Records To Flag Minor Chemical Compound Variances.
FAQ
What is the purpose of this AGENTS.md Template for ETL pipelines?
It defines the end-to-end agent roles, handoffs, and governance for ETL workflows to enable reliable, auditable multi-agent orchestration.
How do agents coordinate in a multi-agent ETL pipeline?
Agents communicate via a planner/orchestrator that sequences Extract, Transform, Load, and Validate steps, passing context and results through well-defined contracts.
How is data quality validated in the template?
Quality gates are enforced by a Validator Agent that checks row counts, schema conformance, nullability, and domain-specific rules, with remediation and escalation when needed.
What happens on failure or rollback in the ETL workflow?
Failures trigger retries with backoff, and after max retries, a rollback to the last known-good checkpoint occurs with human review if necessary.
How is security maintained in the ETL AGENTS.md Template?
Access is governed by least privilege, secrets are stored in a central vault, credentials rotate on schedule, and all data-access events are audited.