Terraform 301 — Reverse-engineering an inherited IaC repo
The senior engineer left. You have a Git repo, a few files, AWS developer access in your name, and Git/Terraform knowledge. No documentation. No tribal knowledge. This is your forensic playbook — what to read, what to query, what to map — before you make any change.
01 The forensic mindset
"Read before you run, run before you write."
Inheriting an unknown IaC repo is exactly like joining a crime scene. Touch nothing first. Photograph everything. Then move outward in expanding circles. Five phases, in this order:
The output of this whole exercise is your own runbook — a single markdown file that becomes the next senior engineer's onboarding doc. You're filling in the gaps that should have been documented but weren't.
02 Phase 1 · Repo archaeology
"Before you call AWS even once, harvest everything the repo already tells you."
The repo, if it's a normal Terraform setup, contains the answers to ~80% of your questions: which AWS accounts, which buckets hold state, which IAM roles get assumed, which CIDRs each env uses. You just have to read systematically.
2.1 · The first eight commands you run after cloning
# 1. Get the lay of the land git clone <repo-url> && cd <repo> ls -la find . -maxdepth 3 -type d | grep -v '^\./\.git' | sort # 2. What does the repo say about itself? cat README.md # whatever is there, however bad cat CONTRIBUTING.md # sometimes has the real instructions cat CODEOWNERS # ownership map - tells you who knew what cat .terraform-version # or .tool-versions - pinned binary version # 3. Find ALL Terraform root folders (places where backend.tf lives) find . -name 'backend.tf' -not -path '*/.terraform/*' # 4. Find ALL modules find . -path '*/modules/*' -name 'main.tf' | head # 5. Read every backend.tf - this is your gold find . -name 'backend.tf' -not -path '*/.terraform/*' -exec echo '=== {} ===' \; -exec cat {} \; # 6. Read every providers.tf - reveals account ids and roles find . -name 'providers.tf' -not -path '*/.terraform/*' -exec echo '=== {} ===' \; -exec cat {} \; # 7. Read every *.tfvars - reveals env values, account ids, CIDRs find . -name '*.tfvars' -not -path '*/.terraform/*' -exec echo '=== {} ===' \; -exec cat {} \; # 8. Look at recent activity - who's been touching this git log --since='6 months ago' --pretty=format:'%h %an %ad %s' --date=short | head -40 git log --since='12 months ago' --pretty=format:'%an %ae' | sort | uniq -c | sort -rn | head -10
2.2 · What you're looking for in each file
| File | What to extract | Where to write it down |
|---|---|---|
backend.tf (one per env) | S3 bucket name, key path, region, DynamoDB table, KMS key id | Worksheet section "State backends" |
providers.tf | Region, role_arn (account id is in the ARN), default tags | Worksheet section "Accounts & roles" |
*.tfvars | Per-env CIDRs, instance sizes, account ids, owner tags | Worksheet section "Environments" |
variables.tf | Inputs the env accepts (especially validation lists — they list every known env name) | Worksheet section "Environments" |
.terraform.lock.hcl | Exact provider versions in use | Worksheet section "Versions" |
.terraform-version / .tool-versions | Terraform binary version — install this exact version | Worksheet section "Versions" |
.github/workflows/*.yml or equivalent CI | Triggers (push to main, paths), OIDC role assumed, what commands run | Worksheet section "Pipeline" |
CODEOWNERS | Which teams/people own which paths — this is your contact list | Worksheet section "People" |
modules/*/main.tf | What each module builds; what providers it pins | Worksheet section "Modules" |
Top-level README.md | Everything — even partial info | Whole worksheet |
2.3 · Git forensics — the human history
# Who's been the de-facto owner recently? git shortlog -sne --since='6 months ago' | head # What was the last big change to prod? git log --all --pretty=format:'%h %ad %an %s' --date=short -- envs/prod/ | head -20 # Who originally created the most-critical files? git log --diff-filter=A --pretty=format:'%h %an %ad' --date=short -- envs/prod/backend.tf # What modules have been touched recently? (signals active vs dormant) git log --since='3 months ago' --name-only --pretty=format: -- modules/ | sort | uniq -c | sort -rn # Dump every PR-style commit message (if squashing was used) git log --pretty=format:'%s' --grep='INFRA-' | head -50 # Tags = release markers. Some teams use them to mark applies. git tag --sort=-creatordate | head -10
2.4 · Worksheet — fill in as you read
Save this as RUNBOOK.md in your own scratch folder (don't commit yet). Fill the right column from what you find.
| A. Versions | |
|---|---|
| Terraform binary version | |
| AWS provider version | |
| Other providers (random, tls…) | |
B. Environments (one row per env folder under envs/) | |
| Env name | |
| AWS account id (from role_arn) | |
| Region | |
| VPC CIDR (from tfvars) | |
| State bucket + key | |
| Lock table | |
| Deploy role ARN | |
| C. Modules | |
Names of modules under modules/ | |
| External / remote modules used | |
Modules with lifecycle.prevent_destroy | |
| D. Pipeline | |
| CI system (GitHub Actions / GitLab / Jenkins) | |
| OIDC IAM role used by CI | |
| What triggers a plan vs an apply | |
| Who has merge rights on prod paths | |
| E. People | |
| Last 3 most-active committers (months & counts) | |
| CODEOWNERS teams | |
| Who reviewed the most prod PRs | |
03 Phase 2 · AWS read-only recon
"Confirm what the repo claims, find what the repo doesn't say."
You have AWS developer access in your name. Configure SSO, log in, and run only describe-*, list-*, get-* calls. No mutations. No applies. No console clicks that change things.
3.1 · Confirm who you are
# 1. Sanity-check your identity in each account you have access to aws sso login --sso-session <your-sso> aws sts get-caller-identity # your developer-role identity # 2. List every account you can see (org-level only) aws organizations list-accounts # may be denied if you're not in the mgmt account aws ec2 describe-regions --output table # confirm region access
3.2 · Find the state buckets & lock tables
Repo backend.tf told you the bucket name. Now confirm it exists and inspect its protection.
# Confirm the bucket from backend.tf actually exists and you can read it aws s3api head-bucket --bucket lf-tfstate-prod-111 aws s3api get-bucket-versioning --bucket lf-tfstate-prod-111 aws s3api get-bucket-encryption --bucket lf-tfstate-prod-111 aws s3api get-public-access-block --bucket lf-tfstate-prod-111 aws s3api get-bucket-policy --bucket lf-tfstate-prod-111 --output text \ | python3 -m json.tool # List state files (one per env) aws s3 ls s3://lf-tfstate-prod-111/ --recursive # Confirm the DDB lock table aws dynamodb describe-table --table-name lf-tfstate-locks # Search for OTHER possible state buckets the repo didn't mention aws s3api list-buckets --query 'Buckets[?contains(Name,`tfstate`)||contains(Name,`terraform`)].Name' --output table
3.3 · Map out the deploy IAM roles
# From providers.tf you have role ARNs like arn:aws:iam::111111111111:role/TerraformDeploy aws iam get-role --role-name TerraformDeploy aws iam get-role --role-name TerraformDeploy --query 'Role.AssumeRolePolicyDocument' \ | python3 -m json.tool # who CAN assume it aws iam list-attached-role-policies --role-name TerraformDeploy aws iam list-role-policies --role-name TerraformDeploy # Get every inline policy's contents for p in $(aws iam list-role-policies --role-name TerraformDeploy --query 'PolicyNames[]' --output text); do echo "=== $p ===" ; aws iam get-role-policy --role-name TerraformDeploy --policy-name $p --query 'PolicyDocument' | python3 -m json.tool done # Find OTHER roles that may be related (CI's OIDC role, break-glass, etc.) aws iam list-roles --query 'Roles[?contains(RoleName,`Terraform`)||contains(RoleName,`CI`)||contains(RoleName,`Deploy`)].[RoleName,Arn]' --output table
3.4 · Map VPCs & key resources to envs
# All VPCs + their tags (the Environment tag should match what tfvars said) aws ec2 describe-vpcs --query 'Vpcs[].{Id:VpcId,Cidr:CidrBlock,Tags:Tags}' --output json \ | python3 -m json.tool # EC2 instances grouped by env tag aws ec2 describe-instances \ --query 'Reservations[].Instances[?State.Name==`running`].[InstanceId,InstanceType,Tags[?Key==`Environment`]|[0].Value]' \ --output table # RDS clusters / instances aws rds describe-db-clusters --query 'DBClusters[].[DBClusterIdentifier,Engine,EngineVersion,MultiAZ]' --output table aws rds describe-db-instances --query 'DBInstances[].[DBInstanceIdentifier,DBInstanceClass,Engine]' --output table # Load balancers aws elbv2 describe-load-balancers --query 'LoadBalancers[].[LoadBalancerName,Scheme,Type,VpcId]' --output table # Route53 zones (helps you understand DNS) aws route53 list-hosted-zones --query 'HostedZones[].[Name,Id,Config.PrivateZone]' --output table
3.5 · Cross-check repo vs reality
For each env in your worksheet, confirm:
| Repo claims | Confirm with | What "wrong" looks like |
|---|---|---|
VPC CIDR 10.20.0.0/16 | aws ec2 describe-vpcs --filters Name=tag:Environment,Values=uat | No VPC with that CIDR → either env never applied or it's in a different account |
State key uat/network.tfstate | aws s3 ls s3://lf-tfstate-nonprod-222/uat/ | No file → the env has never been applied (or backend.tf is stale) |
Role TerraformDeploy in 222… | aws iam get-role --role-name TerraformDeploy after assuming nothing | NoSuchEntity → role hasn't been bootstrapped, or it's named differently in this account |
| Aurora cluster running | aws rds describe-db-clusters | Cluster missing or in different state → manual changes happened |
3.6 · Find untracked-but-running infrastructure
# List of all VPCs vs list in tfvars - any orphans? aws ec2 describe-vpcs --query 'Vpcs[?Tags==null || !contains(Tags[].Key, `ManagedBy`)].[VpcId,CidrBlock]' --output table # ↑ VPCs with no ManagedBy tag are likely click-ops or legacy # EC2 instances NOT tagged ManagedBy=terraform aws ec2 describe-instances \ --query 'Reservations[].Instances[?!(Tags && Tags[?Key==`ManagedBy` && Value==`terraform`])].[InstanceId,LaunchTime,Tags]' \ --output json | python3 -m json.tool | head -60 # Recent admin-level changes (last 24h) - CloudTrail aws cloudtrail lookup-events --max-items 50 \ --lookup-attributes AttributeKey=EventName,AttributeValue=ConsoleLogin aws cloudtrail lookup-events --max-items 50 \ --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \ --query 'Events[].[EventTime,Username,Resources[0].ResourceName]' --output table
04 Phase 3 · State is the truth (read it carefully)
"State knows what Terraform thinks it owns. AWS knows what's actually running. The diff between them is your project for the next month."
State is the JSON file Terraform writes after every apply. Reading it (carefully, read-only) is the single highest-value thing you can do this week. It tells you the exact resource graph, which modules were used, what addresses things have, and what attributes Terraform considers authoritative.
· Always work on a local copy. Never edit the live file in S3.
· Use
terraform state commands; never jq -e + aws s3 cp ... s3://... back.· State files contain secrets in plaintext (RDS passwords, etc.). Don't paste in Slack. Delete the local copy when done.
4.1 · Safe read of every state file
# 1. Make a sandbox folder OUTSIDE the repo mkdir -p ~/tf-recon && cd ~/tf-recon # 2. Pull every state file (one per env) to inspect for env in test uat prod-support prod ; do bucket=$(awk -F'"' '/bucket/{print $2}' ~/repo/envs/$env/backend.tf) key=$(awk -F'"' '/key/{print $2}' ~/repo/envs/$env/backend.tf) echo "=== $env: s3://$bucket/$key ===" aws s3 cp s3://$bucket/$key ./$env.tfstate.json done # 3. Quick stats per env: how many resources, what kinds for f in *.tfstate.json ; do echo "=== $f ===" python3 -c " import json,sys,collections d=json.load(open('$f')) print('terraform_version:', d.get('terraform_version')) print('serial:', d.get('serial')) res = d.get('resources', []) print('resources:', len(res)) c = collections.Counter([r['type'] for r in res]) for t,n in c.most_common(15): print(f' {n:4d} {t}')" done
4.2 · Use Terraform itself, not ad-hoc grep
Once you've run terraform init in each env folder, these commands work against the live state without modifying it:
cd ~/repo/envs/uat terraform init # once per env folder; no apply terraform state list # every resource address terraform state list | wc -l # count terraform state list | awk -F. '{print $1"."$2}' | sort -u # top-level groups terraform state show 'module.network.aws_vpc.this' # full attributes of one terraform providers # providers used + versions terraform graph | head -30 # DOT format dependency graph terraform output # what the env exports # The big one: what would Terraform CHANGE if you ran apply right now? terraform plan -var-file=uat.tfvars -lock=false -refresh-only -no-color | tee plan.txt # refresh-only = no resource changes, just a sync from AWS reality - safe
4.3 · The state-side questions to answer
| Question | How to answer | Why it matters |
|---|---|---|
| Which modules are referenced? | terraform state list | awk -F. '/^module/{print $2}' | sort -u | Confirms which folder under modules/ is "live" |
| Are there resources NOT in any module? | terraform state list | grep -v '^module\.' | "Loose" resources at the env root — common for KMS, IAM |
| Which RDS engines / sizes? | terraform state show | grep -E 'engine|class|allocated' | Tells you cost & scaling story |
Any prevent_destroy tripwires? | Search code: grep -rn prevent_destroy modules/ envs/ | These resources are intentionally hard to delete |
Any ignore_changes markers? | grep -rn ignore_changes modules/ envs/ | Tells you what Terraform deliberately won't reconcile |
| Which provider versions did each apply use? | jq '.terraform_version,.serial' <env>.tfstate.json | Old serial → env hasn't been applied recently |
| What's the last-modified time of state? | aws s3api head-object --bucket B --key K | Tells you how stale the env is |
4.4 · The signature commands to know
terraform state list # inventory terraform state show '<addr>' # attrs of one terraform state pull > local.tfstate # dump current state terraform plan -refresh-only # sync state to AWS, show drift terraform providers # providers + versions terraform output -json | python3 -m json.tool # exports terraform graph | dot -Tpng > graph.png # visual (needs graphviz)
terraform apply, terraform destroy, terraform state rm, terraform state mv, terraform import, terraform taint. All of these mutate state. Reading is free; writing is irreversible.05 Phase 4 · Pipeline trace
"Reproduce mentally what happens between <merge to main> and <EC2 modified>. If you can't, you don't yet own the pipeline."
Pipelines are the part that's easiest to get wrong because they're spread across three places: the workflow YAML in the repo, the OIDC IAM role in AWS, and the CI/CD platform's own settings.
5.1 · Read the workflow files line by line
# GitHub Actions, GitLab, Jenkins, etc - find them all ls -la .github/workflows/ 2>/dev/null ls -la .gitlab-ci.yml 2>/dev/null ls -la Jenkinsfile 2>/dev/null ls -la .circleci/ 2>/dev/null # For each workflow, answer these in order: cat .github/workflows/terraform-plan.yml | grep -E 'on:|push:|paths:|branches:|workflow_dispatch:' # trigger cat .github/workflows/terraform-plan.yml | grep -E 'permissions:|id-token:|contents:' # OIDC cat .github/workflows/terraform-plan.yml | grep -E 'role-to-assume|aws-region' # IAM role cat .github/workflows/terraform-plan.yml | grep -E 'terraform fmt|validate|plan|apply' # commands
5.2 · What to extract from the workflow
| Question | Where to find the answer |
|---|---|
| What triggers a plan? | on: block. Usually pull_request + push to feature branches. |
| What triggers an apply? | on: usually push: to main with paths: filter to a specific env folder. |
| How does CI authenticate to AWS? | The aws-actions/configure-aws-credentials step + role-to-assume ARN. That ARN is your CI role. |
| What env gets applied first? | Job ordering: typically test → uat → prod-support → prod with needs: dependencies. |
| Are there manual approval gates? | GitHub Environments / required reviewers settings; environment: keyword in jobs. |
| What secrets does CI consume? | ${{ secrets.* }} references → cross-check Settings → Secrets in the platform UI. |
| What runs as part of plan? | look for tflint, tfsec/checkov, infracost, terraform-docs. |
5.3 · Find the OIDC trust on the AWS side
# Find the OIDC provider GitHub uses (or whichever CI) aws iam list-open-id-connect-providers # expected: arn:aws:iam::<acct>:oidc-provider/token.actions.githubusercontent.com # Find the role that CI assumes (you saw the ARN in the workflow YAML) aws iam get-role --role-name lf-github-ci aws iam get-role --role-name lf-github-ci --query 'Role.AssumeRolePolicyDocument' | python3 -m json.tool # Look at "Condition" - it should pin sub:repo:<org/repo>:ref:refs/heads/main or PR # What can the CI role do? (this is what your CI is allowed to do, no more) aws iam list-attached-role-policies --role-name lf-github-ci aws iam list-role-policies --role-name lf-github-ci
5.4 · CloudTrail tells you what actually happens
You don't have to guess what the pipeline does — CloudTrail records every API call.
# Recent applies (anything that called PutObject on the state bucket) aws cloudtrail lookup-events --max-items 30 \ --lookup-attributes AttributeKey=ResourceName,AttributeValue=lf-tfstate-prod-111 \ --query 'Events[].[EventTime,EventName,Username]' --output table # Who has been assuming the deploy role recently? aws cloudtrail lookup-events --max-items 30 \ --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \ --query 'Events[?contains(Resources[].ResourceName, `TerraformDeploy`)].[EventTime,Username]' \ --output table # Recent ec2 / rds modifications - cross-check with PR history aws cloudtrail lookup-events --max-items 50 \ --lookup-attributes AttributeKey=EventName,AttributeValue=ModifyLaunchTemplate
5.5 · Draw the pipeline you found (mental check)
By the end of phase 4 you should be able to fill this in for your repo:
| Pipeline trace | |
|---|---|
| CI system | |
| Workflow file path(s) | |
| Plan triggers (when does CI plan run?) | |
| Apply triggers (what causes apply?) | |
| Apply order across envs | |
| OIDC provider ARN | |
| CI's IAM role ARN | |
| CI's role policy summary | |
| Manual approval gates | |
| Secrets consumed by CI | |
| Last 5 successful prod applies (date, author) | |
06 Phase 5 · Build YOUR runbook
"If you got hit by a bus tomorrow, the next person should be able to run this stack from the document you wrote."
The deliverable from phases 1-4 is one markdown file: RUNBOOK.md. Open a PR adding it to the repo. This is also your interview with the codebase — the act of writing the runbook is what makes you the new owner.
6.1 · Runbook template (paste, fill, commit)
# RUNBOOK · lf-platform infrastructure ## 1. Accounts | Account name | Account id | Purpose | Who has admin | |--------------|---------------|------------------|---------------| | Prod | 111111111111 | prod, prod-supp | infra-seniors | | Non-prod | 222222222222 | uat, test | infra-platform| ## 2. Environments | Env | Account | Region | VPC CIDR | State bucket / key | |--------------|---------|-----------|----------------|------------------------------------| | prod | 111… | us-east-1 | 10.10.0.0/16 | lf-tfstate-prod-111 / prod/… | | prod-support | 111… | us-east-1 | 10.11.0.0/16 | lf-tfstate-prod-111 / prod-support/…| | uat | 222… | us-east-1 | 10.20.0.0/16 | lf-tfstate-nonprod-222 / uat/… | | test | 222… | us-east-1 | 10.30.0.0/16 | lf-tfstate-nonprod-222 / test/… | ## 3. Deploy roles - arn:aws:iam::111…:role/TerraformDeploy — assumed by engineers and CD for prod & prod-support - arn:aws:iam::222…:role/TerraformDeploy — assumed for uat & test - arn:aws:iam::<ci-acct>:role/lf-github-ci — OIDC role assumed by GitHub Actions ## 4. Pipeline - CI system: GitHub Actions - Plan job: on every PR, posts plan as comment - Apply job: on push to main, env detected via changed paths - Apply order: test → uat → prod-support → prod (with manual gate before prod) - Required approvals: 1 for non-prod, 2 senior + security for prod ## 5. State protection - Buckets versioned + KMS encrypted + public access blocked - DDB lock table: lf-tfstate-locks (shared) - prevent_destroy on: aws_rds_cluster.prod, aws_kms_key.app, state buckets ## 6. Local engineer setup 1. aws sso login --sso-session lf 2. AWS_PROFILE=tf-nonprod (or tf-prod) 3. tfenv install (uses .terraform-version) 4. cd envs/<env> && terraform init 5. terraform plan -var-file=<env>.tfvars ## 7. Make a change (BAU SOP summary) 1. Sync main, baseline plan should say "No changes." 2. Branch infra-####-slug 3. Edit smallest possible change (usually only tfvars) 4. fmt + validate + plan locally 5. Commit, push, open PR 6. Wait for CI plan comment, get approvals 7. Squash merge; CD applies 8. Post-apply: terraform plan should say "No changes." ## 8. Known anomalies / drift / non-Terraform resources - <list things you found in phase 2 that aren't in code> - <list things in code that prevent_destroy or ignore_changes mark as special> ## 9. Recovery - Roll back: git revert <sha> on main, CD reapplies. Don't fix in console. - State lock stuck: terraform force-unlock <ID> (only after confirming no live run). ## 10. Owners & contacts - Platform: @lf/infra-platform - Security review: @lf/security - On-call rotation: PagerDuty schedule "infra-platform" ## 11. Open questions - <things you still don't know - that's OK, write them down>
6.2 · What goes in section 11 ("Open questions")
This is the most important section. It's where you list everything you couldn't reverse-engineer with confidence. Examples:
- "There's an EC2 instance
i-0deadbeefin prod taggedManagedBy=manual— nobody knows why it exists. Pinged finance and ops; awaiting answer." - "Module
modules/legacy-vpnhasn't been touched in 18 months and isn't referenced anywhere. Confirm with networking team it's safe to delete." - "The CI role has
iam:*permissions in prod. That's broader than it needs. Worth tightening once we have parity tests." - "
prod-supportenv's last apply was 11 months ago. Either it's stable, or its state has drifted significantly. Need to plan-refresh-only and review."
07 Red flags & smells — what to worry about
"You're not just inheriting a working system. You're inheriting all the things the previous owner was meaning to fix."
Walk through this list against your repo and AWS account. Anything that fires is a candidate for the runbook's "open questions" section — not necessarily something to fix immediately, but something to know.
7.1 · Repo smells
.terraform.lock.hcl committed. Provider versions can shift between runs — CI plan and local plan can diverge silently. Add it on your first PR.terraform.tfstate in the repo or no backend.tf, state has been on someone's laptop. Stop. Migrate to S3 + DDB before any further apply.grep -rn 'AKIA' . — AWS access keys committed are an immediate rotation event. Same for any aws_secret_access_key string..tfvars. RDS passwords, API tokens. Move to Secrets Manager / SSM. Rotate the leaked ones.git@github.com:someone-personal/…. Move them in-repo or into your org's namespace..tfvars auto-loaded everywhere. Files named terraform.tfvars or *.auto.tfvars auto-load and silently override. Prefer explicit -var-file=<env>.tfvars.main. source = "git::…//modules/x?ref=main" means apply behavior shifts as upstream changes. Pin to a tag/sha.7.2 · AWS smells
aws s3api get-bucket-versioning + get-bucket-encryption.aws s3api get-public-access-block — all four blocks should be true. State files contain secrets.ManagedBy tag. Either drift or legacy. Decide: import or delete.0.0.0.0/0 on non-public ports. Especially port 22, 3389, 3306. Run aws ec2 describe-security-groups and grep.*:* on Resource:*. Especially if assumed by CI. Tighten before you trust it for production applies.7.3 · Pipeline smells
repo:org/repo) AND the branch/env (ref:refs/heads/main). If it allows any branch, any fork can apply.7.4 · State smells
serial hasn't moved in 6+ months. Means env has no recent applies — either rock-stable or dangerously drifted. Run terraform plan -refresh-only in a sandbox to find out.terraform plan wants to destroy them. Likely a previous engineer started a refactor and didn't finish.terraform import.08 The first-safe-change ritual
"You don't really own the system until you've successfully applied a change to it. Make that first change as small and reversible as humanly possible."
This is the rite of passage. It proves the whole pipeline still works for you — SSO, role assumption, CI, CD, state, reality. You'll discover what's broken about your access in the smallest, lowest-stakes way possible.
8.1 · Pick the right "smallest possible" change
| Change | Risk | Reversibility | Recommended for first? |
|---|---|---|---|
Add a tag (e.g. RunbookOwner = "your-name") to one env's common_tags | Almost zero | Trivial revert | Yes — do this |
| Add a comment to a tfvars file | Zero (no plan diff) | Trivial | Doesn't actually exercise apply |
Bump desired_capacity by 1 in test ASG | Low | Revert PR | Good second change |
| Add an output to a module | Low | Revert | Good second change |
| Anything in prod | High | Full BAU SOP | No. Not yet. |
| Provider version bump | Medium-high | Sometimes painful | Definitely not first |
8.2 · The 11-step ritual
- Pick the
testenvironment. (Smallest blast radius.) aws sso login;export AWS_PROFILE=tf-nonprod;aws sts get-caller-identity— confirm role.cd envs/test;terraform init. Read every line of init output. Note any provider version warnings.terraform plan -var-file=test.tfvars. Expect "No changes." If you see changes you didn't author, STOP and put it in the runbook's open questions. Don't proceed.- Make the smallest change: add
RunbookOwner = "<your-name>"tocommon_tagsintest.tfvars. terraform fmt;terraform validate.terraform plan -var-file=test.tfvars. The plan should show only tag-update diffs (typically~ tagson every taggable resource). Read all of them.- Branch, commit, push, open PR. Title: "INFRA-XXXX: claim test env ownership via tag".
- Watch CI. The CI plan in the PR comment should match your local plan exactly. If it doesn't — STOP and figure out why.
- Get a reviewer (if no one's on the team, ping CODEOWNERS), squash-merge.
- Watch CD apply. Then run
terraform plan -var-file=test.tfvarslocally one more time. Expect "No changes."
8.3 · What to do if step 4 (baseline plan) wasn't clean
Likely you've inherited drift. Don't try to "fix" it in your first PR. Instead:
- Capture the plan output to a file:
terraform plan -var-file=test.tfvars -no-color | tee baseline-drift.txt. - Categorise each diff: drift (AWS changed), code-rot (someone edited code without applying), or in-flight PR (someone else's open PR you haven't pulled).
- Add it to the runbook under "Known anomalies." Discuss with the team before reconciling.
- Pick a different env (one that does show "No changes") for your first-safe-change ritual. Or, if every env is dirty, fixing one drift item is your first change — but document carefully and have it reviewed.
8.4 · Graduating from this guide
You've completed the inheritance when:
RUNBOOK.mdis merged to main.- Your first-safe-change tag is live in test.
- You've run a clean
planagainst every env in the repo. - You can answer all the worksheet questions from memory.
- You've identified at least three "open questions" worth following up on.
- Your name appears in CODEOWNERS for at least one path.
Now go back to Part 2 and run a real ticket through the BAU SOP. You're no longer inheriting — you're operating.