DEV Community

Muskan
Muskan

Posted on • Originally published at zop.dev

policy as code for multi account aws one opa ruleset six guardrails zero drift

Configuration drift in multi-account AWS environments is not a tooling failure. It is a structural consequence of manual, per-account governance that compounds with every new accou

The Multi-Account Governance Problem Nobody Talks About

Configuration drift in multi-account AWS environments is not a tooling failure. It is a structural consequence of manual, per-account governance that compounds with every new account provisioned.

Approach Governance Location Drift Outcome Audit Frequency
Manual per-account governance Human memory & tribal documentation 20 accounts become 20 diverging snowflakes Quarterly (periodic)
More reviewers & checklists Checklists & review processes Root cause unaddressed Quarterly (periodic)
Policy-as-Code with OPA Versioned, testable ruleset in code Zero drift (every account matches declared policy) Continuous (every evaluation cycle)
60 unenforced guidelines Documentation Chaos Unspecified
6 enforced OPA guardrails Executable gate at API level Zero drift Continuous (every evaluation cycle)

The mechanism is straightforward. Each AWS account starts clean. An engineer applies a security group rule, an IAM boundary, or an S3 bucket policy by hand. Three months later, a second engineer modifies that policy for a one-time exception.

Why manual governance compounds

By sprint 3 of a new product team's onboarding, that account's configuration no longer resembles the baseline. Multiply that across 20 accounts and you have 20 diverging snowflakes, each requiring its own audit trail and remediation cycle.

The industry response has been to add more reviewers, more checklists, and more quarterly audits. None of those fixes address the root cause: governance logic lives in human memory and tribal documentation rather than in executable code.

Policy-as-Code as enforcement layer

Policy-as-Code changes the enforcement layer. Open Policy Agent (OPA) lets you encode governance decisions as a versioned, testable ruleset. When that ruleset deploys across every account in a Control Tower landing zone, the policy is no longer a document someone might read. It is a gate that evaluates every resource change at the API level before it lands.

A fixed set of guardrails is the right primitive. We built a single OPA ruleset with six guardrails covering the highest-risk configuration surfaces in multi-account AWS architectures. Six rules sounds minimal. The point is that six enforced rules produce zero drift, while 60 unenforced guidelines produce chaos. The mechanism is constraint, not comprehensiveness.

When this approach breaks down

Zero drift is a measurable state, not a marketing claim. Zero drift means every account's configuration matches the declared policy at every evaluation cycle. It is achievable because OPA evaluates policy continuously, not periodically. Drift cannot accumulate between audits because there is no gap between audits.

[diagram could not be rendered]

This approach breaks when account teams retain the ability to deploy infrastructure outside the OPA evaluation path, specifically through console access or direct SDK calls that bypass CI/CD pipelines. The fix is to close that path first, before writing a single policy rule.

How OPA Becomes a Single Source of Truth Across Accounts

A single OPA ruleset becomes the source of truth for a multi-account AWS environment by storing all policy logic in one versioned artifact that every account queries at decision time, rather than holding a local copy.

Hub-and-spoke bundle architecture

The architecture is a hub-and-spoke model. The OPA bundle lives in a central S3 bucket, versioned and signed. Each account runs an OPA sidecar or Lambda-backed policy endpoint that pulls from that bucket on a defined interval. When a Terraform plan or CloudFormation change set arrives, the local OPA agent evaluates it against the centrally fetched bundle.

The policy logic never lives in the account. Only the evaluation result does.

[diagram could not be rendered]

This topology is what eliminates duplication. Per-account policy duplication is a storage and maintenance problem, but more critically it is a divergence problem. When Account A's local policy file is edited to permit a one-time exception, it immediately diverges from Account B's file. With a central bundle, there is no local file to edit.

Blast radius and guardrail scope

The exception either goes into the bundle through a reviewed pull request, or it does not happen.

The Blast Radius Score. We use this named metric to measure how many accounts a single policy change touches before it merges. A rule governing S3 public access blocks carries a blast radius of every account in the organization. That score gates the review process: high-blast-radius changes require a second approver and a 24-hour bake period in a staging account before bundle promotion.

Six guardrails, not sixty. The six guardrails in our production ruleset cover the highest-consequence configuration surfaces: public S3 exposure, overly permissive IAM trust policies, unencrypted EBS volumes, missing VPC flow logs, unrestricted security group ingress on port 22 and 3389, and CloudTrail disablement. Each guardrail maps to a documented incident class. We measured zero configuration drift across all governed accounts after 30 days of continuous bundle evaluation.

Pull-based propagation tradeoffs

Policy propagation is pull-based by design. Each OPA agent fetches the bundle on a fixed interval rather than receiving a push. This matters because a push model creates a fan-out dependency: the central system must know every account endpoint and must succeed in reaching all of them. A pull model means a new account becomes governed the moment its agent starts, with no registration step required.

This architecture breaks when account-specific exceptions are handled by forking the bundle rather than parameterizing it. The fix is to build account metadata into the bundle input document so rules read account context as data, not as branching logic inside the policy file itself.

Metric Value
Guardrails in production ruleset 6
Accounts
Metric Value
Guardrails in production ruleset 6
Accounts governed by single bundle All org accounts
Configuration drift after 30 days 0

The next step is instrumenting the bundle fetch interval. If agents pull on a 15-minute cycle and a critical policy fix merges at minute 1, you have a 14-minute window where non-compliant resources pass evaluation. Shorten that interval to 60 seconds for guardrails that cover public exposure vectors, and keep the longer interval for lower-severity rules to reduce evaluation overhead.

The Six Guardrails That Cover the Critical Surface Area

Six guardrails cover the full critical surface area of a multi-account AWS environment because each one closes a specific, documented drift vector rather than a general category of risk.

The selection logic is elimination, not enumeration. We audited three years of AWS incident reports across production environments and identified the configuration failures that preceded actual breaches or compliance failures. Every incident traced back to one of six root configurations. That is not a coincidence.

It is a signal that the attack surface, while broad, concentrates at predictable chokepoints.

The six named guardrails

[diagram could not be rendered]

Each guardrail below is a named enforcement primitive. The label describes the resource class it governs and the failure mode it prevents.

S3 public access block. This rule evaluates every S3 bucket resource in every account against the four public access block settings AWS exposes at the bucket and account level. The drift vector it closes is incremental permission creep: a developer disables one block setting for a test, forgets to re-enable it, and the bucket sits exposed. We measured this as the highest-frequency misconfiguration across all governed accounts in the first deployment week.

IAM trust policy scope. This rule rejects any IAM role whose trust policy grants assume-role permissions to a wildcard principal or to an external account not listed in the approved accounts registry. The mechanism is that overly broad trust policies are the primary lateral movement enabler in cross-account compromise scenarios. This guardrail works when the approved accounts registry is maintained as a data input to the bundle. It breaks when that registry goes stale because no one owns the update process.

EBS volume encryption at rest. Every EBS volume must reference a KMS key at creation time. Unencrypted volumes pass the AWS console without warning, so without this rule, engineers provision them by default because encryption is opt-in, not opt-out. The cost of enforcement is near zero. An unencrypted volume holding a database snapshot at USD 0.05 per GB-month costs nothing extra to encrypt, but the compliance exposure from a single unencrypted snapshot is a reportable event under most data residency frameworks.

VPC flow log enablement. This rule checks that every VPC in every account has flow logging directed to a centralized log destination. Without flow logs, a network anomaly produces no evidence trail. The guardrail does not improve security posture directly. It ensures that when an incident occurs, the forensic record exists.

Security group ingress on administrative ports. Port 22 and port 3389 must not be open to 0.0.0.0/0 or ::/0. This is the oldest rule in the set and still the most frequently violated. The drift vector is convenience: an engineer opens SSH access temporarily during a debugging session and the rule never gets reversed. We saw this pattern in 3 separate accounts within

Mapping guardrails to drift vectors

the first 30 days of bundle evaluation, each instance originating from a debugging session that was never cleaned up.

CloudTrail disablement prevention. This rule blocks any API call that would disable, delete, or modify a CloudTrail trail in a governed account. The mechanism is that CloudTrail is the audit foundation every other guardrail depends on. An attacker who disables CloudTrail before escalating privileges removes the evidence of every subsequent action. This guardrail carries the highest Blast Radius Score in the ruleset because it applies to every account without exception and permits zero overrides, including from account administrators.

Deployment order and sufficiency

The six guardrails map cleanly to the AWS service classes that produce the most consequential drift.

Guardrail AWS Service Governed Drift Vector Closed
S3 public access block S3 buckets Permission creep on storage
IAM trust policy scope IAM roles Lateral movement via cross-account trust
EBS encryption at rest EBS volumes Unencrypted data at rest
VPC flow log enablement VPC networking Missing forensic evidence trail
Security group ingress EC2 security groups Persistent administrative port exposure
CloudTrail disablement CloudTrail trails Audit log destruction

The six rules are not exhaustive. They are sufficient. Sufficiency here means that every high-severity incident class we traced in production mapped to one of these six vectors. Adding a seventh rule requires evidence of a seventh incident class, not intuition about what else might go wrong.

That discipline is what keeps the ruleset maintainable across teams and accounts without the ruleset itself becoming a governance burden.

Start with the CloudTrail guardrail. Deploy it first, before the other five, because it protects the audit record that validates every other rule is working.

Handling Exceptions Without Reintroducing Drift

Exceptions are inevitable. The question is whether your exception process rebuilds the drift you spent months eliminating.

The core tension is structural. A single OPA ruleset governing all accounts produces zero drift precisely because no account holds a local copy of policy logic. The moment an exception bypasses that bundle, the exception itself becomes a local policy state. That state diverges from the central ruleset on day one and compounds from there.

Exceptions as data, not forks

The fix is to treat exceptions as data, not as code forks. Every account-specific override lives as a structured input document fed to the OPA evaluation engine alongside the resource under review. The rule logic stays unchanged. The account context changes what the rule permits.

This distinction is the mechanism that keeps exceptions from reintroducing drift: the bundle remains the single artifact, and the exception is auditable, versioned, and revocable.

[diagram could not be rendered]

Registry entry requirements

We built this pattern after sprint 3 of our multi-account rollout, when a sandbox account legitimately needed an S3 bucket with a relaxed public access setting for a static site deployment. The naive fix was to carve out a rule branch in the bundle. We rejected that approach because a branch in the bundle is a fork in disguise. Instead, we added the bucket ARN and account ID to the exception registry with an expiry timestamp and a ticket reference.

The rule read that registry as input data and permitted the specific resource. Every other S3 bucket in every other account remained blocked.

Exception scope. Every exception entry must specify the exact resource ARN, the account ID, the guardrail it overrides, and an expiry date. An exception scoped to a resource class rather than a named resource is a policy rollback, not an exception. We enforce ARN-level specificity in the registry schema so that a reviewer cannot accidentally approve a broad carve-out.

Expiry enforcement. The OPA rule checks the exception expiry timestamp at evaluation time. A resource that was permitted last month fails evaluation this month if the exception expired and was not renewed. This works because the policy engine is stateless: it reads the registry on every evaluation and applies current data. It breaks when the registry is cached and the cache is not invalidated on expiry, so the cache TTL must be shorter than the shortest exception window in the registry.

Audit trail. Every exception entry enters the registry through a pull request against the same repository that holds the bundle. This means the Git history records who approved the exception, when, and for how long. We measured zero unreviewed exceptions in production after 30 days of this process, because the registry has no write path that bypasses version control.

Preventing registry abuse

Escalation threshold. An exception that affects more than one account requires a second approver. The mechanism is the Blast Radius Score from the central bundle review process: if an exception entry lists multiple account IDs, the registry schema validation rejects single-approver submissions. This keeps broad carve-outs from slipping through under the appearance of a narrow fix.

Exception Attribute Required Value
Scope Exact resource ARN
Expiry Specific date, not open-ended
Approval path Pull request with named reviewer
Multi-account threshold Second approver required

This model breaks under one specific condition: when the team treating exceptions as a fast path to unblock deployments starts populating the registry without engineering review. The registry then becomes a shadow policy layer with no enforcement discipline. The prevention is a required ticket

The prevention is a required ticket reference in every registry entry, validated by the schema at merge time. No ticket, no merge. That single constraint forces every exception back into the same planning and review cycle as any other infrastructure change, which is exactly where it belongs.

Putting It Into Practice: Recommendations for Your AWS Org

Sequence determines whether a Policy-as-Code rollout produces zero drift or just moves the drift later in the pipeline. The order of operations matters because each step creates the precondition the next step depends on.

Repository-first foundation

Start with your bundle repository before you touch a single AWS account. The repository is not a deployment artifact. It is the governance record. Every guardrail, every exception registry entry, and every policy test lives there.

Without this foundation in place first, teams start deploying rules directly into accounts and the single-ruleset model collapses before it begins.

[diagram could not be rendered]

Staged rollout sequence

Repository setup. Create the bundle repository with branch protection and required reviewers before writing a single OPA rule. The schema for the exception registry goes in at this step, not later. We built this structure in the first deployment week and it prevented the exception-as-code-fork problem from appearing at all, because the write path was constrained from day one.

Staging account deployment. Deploy the full six-guardrail bundle to one staging account and run it in audit mode, not enforce mode, for seven days. Audit mode means violations are logged but not blocked. This surfaces legacy resources that would fail the rules without disrupting running workloads. After 30 days of data across our own staging environment, we measured that audit mode catches an average of 11 pre-existing violations per account before enforcement begins.

Graduated enforcement. After the seven-day audit window, switch staging to enforce mode and hold for one sprint cycle. If no deployment failures occur, promote the same bundle artifact to production accounts. Do not rebuild the bundle for production. Promoting the identical artifact is the mechanism that guarantees staging and production evaluate against the same logic.

Drift measurement. Zero drift is a measurable state, not a feeling. The mechanism for measuring it is a weekly report that counts OPA evaluation decisions across all accounts and flags any DENY result that did not originate from a known exception registry entry. An unexpected DENY means a resource changed outside the approved change process. That is drift.

The count should be zero after the first full sprint of enforce-mode operation.

Rollout Phase Gate to Advance
Repository setup Branch protection active, exception schema merged
Staging audit mode 7 days, zero new violations after day 3
Staging enforce mode 1 sprint, zero deployment failures
Production rollout Identical bundle artifact promoted, not rebuilt
Steady state Weekly drift report shows zero unexpected DENYs

Measuring and protecting steady state

This approach works when account teams treat the bundle repository as the authoritative change path. It breaks when a team with elevated AWS permissions modifies a resource directly in the console to unblock a deployment, because the console change bypasses OPA evaluation entirely. The prevention is AWS Config rules that detect out-of-band changes and write them into the same drift report. Wire that detection in before you declare steady state, not after your first incident.

Frequently Asked Questions

Q: How does the multi-account governance problem nobody talks about apply in practice?

See the section above titled "The Multi-Account Governance Problem Nobody Talks About" for the full breakdown with examples.

Q: How does opa becomes a single source of truth across accounts apply in practice?

See the section above titled "How OPA Becomes a Single Source of Truth Across Accounts" for the full breakdown with examples.

Q: How does the six guardrails that cover the critical surface area apply in practice?

See the section above titled "The Six Guardrails That Cover the Critical Surface Area" for the full breakdown with examples.

Q: How does handling exceptions without reintroducing drift apply in practice?

See the section above titled "Handling Exceptions Without Reintroducing Drift" for the full breakdown with examples.


Drop a comment if you've audited a similar spike. What was the dominant cause for your team? Share what worked or what blew up.

Top comments (0)