Terraform 301 — Reverse-engineering an inherited IaC repo

The senior engineer left. You have a Git repo, a few files, AWS developer access in your name, and Git/Terraform knowledge. No documentation. No tribal knowledge. This is your forensic playbook — what to read, what to query, what to map — before you make any change.

Companion to Part 1, Part 2, Part 3.

01 The forensic mindset

"Read before you run, run before you write."

Inheriting an unknown IaC repo is exactly like joining a crime scene. Touch nothing first. Photograph everything. Then move outward in expanding circles. Five phases, in this order:

5-phase reverse-engineering plan · do them in order 1 Repo archaeology offline reading no AWS calls ~1 hour 2 AWS read-only recon describe-* only no apply ~half day 3 State inspection read-only copy never edit ~half day 4 Pipeline trace workflows + OIDC CloudTrail recent ~few hours 5 Your runbook + first safe change document findings prove control in test ~half day YOU ARE RUNNING IT prod is yours now keep updating runbook Three rules that keep you safe through all five phases: 1. Read before you run. Run before you write. Phase 1 has zero AWS calls; phase 2 has zero writes. 2. Document as you go. Every recon command's output goes into a markdown doc you'll keep updating. 3. Touch test first, prod last. Your "first safe change" goes into the smallest, most disposable env.

The output of this whole exercise is your own runbook — a single markdown file that becomes the next senior engineer's onboarding doc. You're filling in the gaps that should have been documented but weren't.

02 Phase 1 · Repo archaeology

"Before you call AWS even once, harvest everything the repo already tells you."

The repo, if it's a normal Terraform setup, contains the answers to ~80% of your questions: which AWS accounts, which buckets hold state, which IAM roles get assumed, which CIDRs each env uses. You just have to read systematically.

2.1 · The first eight commands you run after cloning

# 1. Get the lay of the land
git clone <repo-url> && cd <repo>
ls -la
find . -maxdepth 3 -type d | grep -v '^\./\.git' | sort

# 2. What does the repo say about itself?
cat README.md            # whatever is there, however bad
cat CONTRIBUTING.md      # sometimes has the real instructions
cat CODEOWNERS           # ownership map - tells you who knew what
cat .terraform-version   # or .tool-versions - pinned binary version

# 3. Find ALL Terraform root folders (places where backend.tf lives)
find . -name 'backend.tf' -not -path '*/.terraform/*'

# 4. Find ALL modules
find . -path '*/modules/*' -name 'main.tf' | head

# 5. Read every backend.tf - this is your gold
find . -name 'backend.tf' -not -path '*/.terraform/*' -exec echo '=== {} ===' \; -exec cat {} \;

# 6. Read every providers.tf - reveals account ids and roles
find . -name 'providers.tf' -not -path '*/.terraform/*' -exec echo '=== {} ===' \; -exec cat {} \;

# 7. Read every *.tfvars - reveals env values, account ids, CIDRs
find . -name '*.tfvars' -not -path '*/.terraform/*' -exec echo '=== {} ===' \; -exec cat {} \;

# 8. Look at recent activity - who's been touching this
git log --since='6 months ago' --pretty=format:'%h %an %ad %s' --date=short | head -40
git log --since='12 months ago' --pretty=format:'%an %ae' | sort | uniq -c | sort -rn | head -10

2.2 · What you're looking for in each file

FileWhat to extractWhere to write it down
backend.tf (one per env)S3 bucket name, key path, region, DynamoDB table, KMS key idWorksheet section "State backends"
providers.tfRegion, role_arn (account id is in the ARN), default tagsWorksheet section "Accounts & roles"
*.tfvarsPer-env CIDRs, instance sizes, account ids, owner tagsWorksheet section "Environments"
variables.tfInputs the env accepts (especially validation lists — they list every known env name)Worksheet section "Environments"
.terraform.lock.hclExact provider versions in useWorksheet section "Versions"
.terraform-version / .tool-versionsTerraform binary version — install this exact versionWorksheet section "Versions"
.github/workflows/*.yml or equivalent CITriggers (push to main, paths), OIDC role assumed, what commands runWorksheet section "Pipeline"
CODEOWNERSWhich teams/people own which paths — this is your contact listWorksheet section "People"
modules/*/main.tfWhat each module builds; what providers it pinsWorksheet section "Modules"
Top-level README.mdEverything — even partial infoWhole worksheet

2.3 · Git forensics — the human history

# Who's been the de-facto owner recently?
git shortlog -sne --since='6 months ago' | head

# What was the last big change to prod?
git log --all --pretty=format:'%h %ad %an %s' --date=short -- envs/prod/ | head -20

# Who originally created the most-critical files?
git log --diff-filter=A --pretty=format:'%h %an %ad' --date=short -- envs/prod/backend.tf

# What modules have been touched recently? (signals active vs dormant)
git log --since='3 months ago' --name-only --pretty=format: -- modules/ | sort | uniq -c | sort -rn

# Dump every PR-style commit message (if squashing was used)
git log --pretty=format:'%s' --grep='INFRA-' | head -50

# Tags = release markers. Some teams use them to mark applies.
git tag --sort=-creatordate | head -10

2.4 · Worksheet — fill in as you read

Save this as RUNBOOK.md in your own scratch folder (don't commit yet). Fill the right column from what you find.

A. Versions
Terraform binary version
AWS provider version
Other providers (random, tls…)
B. Environments (one row per env folder under envs/)
Env name
AWS account id (from role_arn)
Region
VPC CIDR (from tfvars)
State bucket + key
Lock table
Deploy role ARN
C. Modules
Names of modules under modules/
External / remote modules used
Modules with lifecycle.prevent_destroy
D. Pipeline
CI system (GitHub Actions / GitLab / Jenkins)
OIDC IAM role used by CI
What triggers a plan vs an apply
Who has merge rights on prod paths
E. People
Last 3 most-active committers (months & counts)
CODEOWNERS teams
Who reviewed the most prod PRs
Your milestone for end of phase 1: you can sketch the 2-account / N-env diagram from Part 1 §5 from memory using only what you read in the repo. If you can't, re-read until you can.

03 Phase 2 · AWS read-only recon

"Confirm what the repo claims, find what the repo doesn't say."

You have AWS developer access in your name. Configure SSO, log in, and run only describe-*, list-*, get-* calls. No mutations. No applies. No console clicks that change things.

3.1 · Confirm who you are

# 1. Sanity-check your identity in each account you have access to
aws sso login --sso-session <your-sso>
aws sts get-caller-identity                   # your developer-role identity

# 2. List every account you can see (org-level only)
aws organizations list-accounts                # may be denied if you're not in the mgmt account
aws ec2 describe-regions --output table        # confirm region access

3.2 · Find the state buckets & lock tables

Repo backend.tf told you the bucket name. Now confirm it exists and inspect its protection.

# Confirm the bucket from backend.tf actually exists and you can read it
aws s3api head-bucket --bucket lf-tfstate-prod-111
aws s3api get-bucket-versioning --bucket lf-tfstate-prod-111
aws s3api get-bucket-encryption --bucket lf-tfstate-prod-111
aws s3api get-public-access-block --bucket lf-tfstate-prod-111
aws s3api get-bucket-policy --bucket lf-tfstate-prod-111 --output text \
  | python3 -m json.tool

# List state files (one per env)
aws s3 ls s3://lf-tfstate-prod-111/ --recursive

# Confirm the DDB lock table
aws dynamodb describe-table --table-name lf-tfstate-locks

# Search for OTHER possible state buckets the repo didn't mention
aws s3api list-buckets --query 'Buckets[?contains(Name,`tfstate`)||contains(Name,`terraform`)].Name' --output table

3.3 · Map out the deploy IAM roles

# From providers.tf you have role ARNs like arn:aws:iam::111111111111:role/TerraformDeploy
aws iam get-role --role-name TerraformDeploy
aws iam get-role --role-name TerraformDeploy --query 'Role.AssumeRolePolicyDocument' \
  | python3 -m json.tool                      # who CAN assume it
aws iam list-attached-role-policies --role-name TerraformDeploy
aws iam list-role-policies        --role-name TerraformDeploy

# Get every inline policy's contents
for p in $(aws iam list-role-policies --role-name TerraformDeploy --query 'PolicyNames[]' --output text); do
  echo "=== $p ===" ; aws iam get-role-policy --role-name TerraformDeploy --policy-name $p --query 'PolicyDocument' | python3 -m json.tool
done

# Find OTHER roles that may be related (CI's OIDC role, break-glass, etc.)
aws iam list-roles --query 'Roles[?contains(RoleName,`Terraform`)||contains(RoleName,`CI`)||contains(RoleName,`Deploy`)].[RoleName,Arn]' --output table

3.4 · Map VPCs & key resources to envs

# All VPCs + their tags (the Environment tag should match what tfvars said)
aws ec2 describe-vpcs --query 'Vpcs[].{Id:VpcId,Cidr:CidrBlock,Tags:Tags}' --output json \
  | python3 -m json.tool

# EC2 instances grouped by env tag
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?State.Name==`running`].[InstanceId,InstanceType,Tags[?Key==`Environment`]|[0].Value]' \
  --output table

# RDS clusters / instances
aws rds describe-db-clusters --query 'DBClusters[].[DBClusterIdentifier,Engine,EngineVersion,MultiAZ]' --output table
aws rds describe-db-instances --query 'DBInstances[].[DBInstanceIdentifier,DBInstanceClass,Engine]' --output table

# Load balancers
aws elbv2 describe-load-balancers --query 'LoadBalancers[].[LoadBalancerName,Scheme,Type,VpcId]' --output table

# Route53 zones (helps you understand DNS)
aws route53 list-hosted-zones --query 'HostedZones[].[Name,Id,Config.PrivateZone]' --output table

3.5 · Cross-check repo vs reality

For each env in your worksheet, confirm:

Repo claimsConfirm withWhat "wrong" looks like
VPC CIDR 10.20.0.0/16aws ec2 describe-vpcs --filters Name=tag:Environment,Values=uatNo VPC with that CIDR → either env never applied or it's in a different account
State key uat/network.tfstateaws s3 ls s3://lf-tfstate-nonprod-222/uat/No file → the env has never been applied (or backend.tf is stale)
Role TerraformDeploy in 222…aws iam get-role --role-name TerraformDeploy after assuming nothingNoSuchEntity → role hasn't been bootstrapped, or it's named differently in this account
Aurora cluster runningaws rds describe-db-clustersCluster missing or in different state → manual changes happened

3.6 · Find untracked-but-running infrastructure

# List of all VPCs vs list in tfvars - any orphans?
aws ec2 describe-vpcs --query 'Vpcs[?Tags==null || !contains(Tags[].Key, `ManagedBy`)].[VpcId,CidrBlock]' --output table
# ↑ VPCs with no ManagedBy tag are likely click-ops or legacy

# EC2 instances NOT tagged ManagedBy=terraform
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?!(Tags && Tags[?Key==`ManagedBy` && Value==`terraform`])].[InstanceId,LaunchTime,Tags]' \
  --output json | python3 -m json.tool | head -60

# Recent admin-level changes (last 24h) - CloudTrail
aws cloudtrail lookup-events --max-items 50 \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ConsoleLogin
aws cloudtrail lookup-events --max-items 50 \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --query 'Events[].[EventTime,Username,Resources[0].ResourceName]' --output table
End-of-phase-2 deliverable: add a column to your worksheet "confirmed in AWS? Y/N" and a free-text column "anomalies." Anomalies are the gold — they're the things the senior never wrote down.

04 Phase 3 · State is the truth (read it carefully)

"State knows what Terraform thinks it owns. AWS knows what's actually running. The diff between them is your project for the next month."

State is the JSON file Terraform writes after every apply. Reading it (carefully, read-only) is the single highest-value thing you can do this week. It tells you the exact resource graph, which modules were used, what addresses things have, and what attributes Terraform considers authoritative.

Rules of the road for state inspection:
· Always work on a local copy. Never edit the live file in S3.
· Use terraform state commands; never jq -e + aws s3 cp ... s3://... back.
· State files contain secrets in plaintext (RDS passwords, etc.). Don't paste in Slack. Delete the local copy when done.

4.1 · Safe read of every state file

# 1. Make a sandbox folder OUTSIDE the repo
mkdir -p ~/tf-recon && cd ~/tf-recon

# 2. Pull every state file (one per env) to inspect
for env in test uat prod-support prod ; do
  bucket=$(awk -F'"' '/bucket/{print $2}' ~/repo/envs/$env/backend.tf)
  key=$(awk    -F'"' '/key/{print $2}'    ~/repo/envs/$env/backend.tf)
  echo "=== $env: s3://$bucket/$key ==="
  aws s3 cp s3://$bucket/$key ./$env.tfstate.json
done

# 3. Quick stats per env: how many resources, what kinds
for f in *.tfstate.json ; do
  echo "=== $f ==="
  python3 -c "
import json,sys,collections
d=json.load(open('$f'))
print('terraform_version:', d.get('terraform_version'))
print('serial:', d.get('serial'))
res = d.get('resources', [])
print('resources:', len(res))
c = collections.Counter([r['type'] for r in res])
for t,n in c.most_common(15): print(f'  {n:4d}  {t}')"
done

4.2 · Use Terraform itself, not ad-hoc grep

Once you've run terraform init in each env folder, these commands work against the live state without modifying it:

cd ~/repo/envs/uat
terraform init                                # once per env folder; no apply

terraform state list                          # every resource address
terraform state list | wc -l                  # count
terraform state list | awk -F. '{print $1"."$2}' | sort -u  # top-level groups

terraform state show 'module.network.aws_vpc.this'   # full attributes of one

terraform providers                           # providers used + versions
terraform graph | head -30                    # DOT format dependency graph

terraform output                              # what the env exports

# The big one: what would Terraform CHANGE if you ran apply right now?
terraform plan -var-file=uat.tfvars -lock=false -refresh-only -no-color | tee plan.txt
# refresh-only = no resource changes, just a sync from AWS reality - safe

4.3 · The state-side questions to answer

QuestionHow to answerWhy it matters
Which modules are referenced?terraform state list | awk -F. '/^module/{print $2}' | sort -uConfirms which folder under modules/ is "live"
Are there resources NOT in any module?terraform state list | grep -v '^module\.'"Loose" resources at the env root — common for KMS, IAM
Which RDS engines / sizes?terraform state show | grep -E 'engine|class|allocated'Tells you cost & scaling story
Any prevent_destroy tripwires?Search code: grep -rn prevent_destroy modules/ envs/These resources are intentionally hard to delete
Any ignore_changes markers?grep -rn ignore_changes modules/ envs/Tells you what Terraform deliberately won't reconcile
Which provider versions did each apply use?jq '.terraform_version,.serial' <env>.tfstate.jsonOld serial → env hasn't been applied recently
What's the last-modified time of state?aws s3api head-object --bucket B --key KTells you how stale the env is

4.4 · The signature commands to know

terraform state list                                     # inventory
terraform state show '<addr>'                            # attrs of one
terraform state pull > local.tfstate                      # dump current state
terraform plan -refresh-only                              # sync state to AWS, show drift
terraform providers                                       # providers + versions
terraform output -json | python3 -m json.tool             # exports
terraform graph | dot -Tpng > graph.png                   # visual (needs graphviz)
Things you must NOT run during recon:
terraform apply, terraform destroy, terraform state rm, terraform state mv, terraform import, terraform taint. All of these mutate state. Reading is free; writing is irreversible.

05 Phase 4 · Pipeline trace

"Reproduce mentally what happens between <merge to main> and <EC2 modified>. If you can't, you don't yet own the pipeline."

Pipelines are the part that's easiest to get wrong because they're spread across three places: the workflow YAML in the repo, the OIDC IAM role in AWS, and the CI/CD platform's own settings.

5.1 · Read the workflow files line by line

# GitHub Actions, GitLab, Jenkins, etc - find them all
ls -la .github/workflows/  2>/dev/null
ls -la .gitlab-ci.yml      2>/dev/null
ls -la Jenkinsfile         2>/dev/null
ls -la .circleci/          2>/dev/null

# For each workflow, answer these in order:
cat .github/workflows/terraform-plan.yml  | grep -E 'on:|push:|paths:|branches:|workflow_dispatch:'     # trigger
cat .github/workflows/terraform-plan.yml  | grep -E 'permissions:|id-token:|contents:'                   # OIDC
cat .github/workflows/terraform-plan.yml  | grep -E 'role-to-assume|aws-region'                          # IAM role
cat .github/workflows/terraform-plan.yml  | grep -E 'terraform fmt|validate|plan|apply'                    # commands

5.2 · What to extract from the workflow

QuestionWhere to find the answer
What triggers a plan?on: block. Usually pull_request + push to feature branches.
What triggers an apply?on: usually push: to main with paths: filter to a specific env folder.
How does CI authenticate to AWS?The aws-actions/configure-aws-credentials step + role-to-assume ARN. That ARN is your CI role.
What env gets applied first?Job ordering: typically test → uat → prod-support → prod with needs: dependencies.
Are there manual approval gates?GitHub Environments / required reviewers settings; environment: keyword in jobs.
What secrets does CI consume?${{ secrets.* }} references → cross-check Settings → Secrets in the platform UI.
What runs as part of plan?look for tflint, tfsec/checkov, infracost, terraform-docs.

5.3 · Find the OIDC trust on the AWS side

# Find the OIDC provider GitHub uses (or whichever CI)
aws iam list-open-id-connect-providers
# expected: arn:aws:iam::<acct>:oidc-provider/token.actions.githubusercontent.com

# Find the role that CI assumes (you saw the ARN in the workflow YAML)
aws iam get-role --role-name lf-github-ci
aws iam get-role --role-name lf-github-ci --query 'Role.AssumeRolePolicyDocument' | python3 -m json.tool
# Look at "Condition" - it should pin sub:repo:<org/repo>:ref:refs/heads/main or PR

# What can the CI role do? (this is what your CI is allowed to do, no more)
aws iam list-attached-role-policies --role-name lf-github-ci
aws iam list-role-policies        --role-name lf-github-ci

5.4 · CloudTrail tells you what actually happens

You don't have to guess what the pipeline does — CloudTrail records every API call.

# Recent applies (anything that called PutObject on the state bucket)
aws cloudtrail lookup-events --max-items 30 \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=lf-tfstate-prod-111 \
  --query 'Events[].[EventTime,EventName,Username]' --output table

# Who has been assuming the deploy role recently?
aws cloudtrail lookup-events --max-items 30 \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --query 'Events[?contains(Resources[].ResourceName, `TerraformDeploy`)].[EventTime,Username]' \
  --output table

# Recent ec2 / rds modifications - cross-check with PR history
aws cloudtrail lookup-events --max-items 50 \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ModifyLaunchTemplate

5.5 · Draw the pipeline you found (mental check)

By the end of phase 4 you should be able to fill this in for your repo:

Pipeline trace
CI system
Workflow file path(s)
Plan triggers (when does CI plan run?)
Apply triggers (what causes apply?)
Apply order across envs
OIDC provider ARN
CI's IAM role ARN
CI's role policy summary
Manual approval gates
Secrets consumed by CI
Last 5 successful prod applies (date, author)
Power move. Open the most recent successful prod apply in the CI UI and read every log line. If you can't reproduce that mentally, your pipeline understanding is incomplete.

06 Phase 5 · Build YOUR runbook

"If you got hit by a bus tomorrow, the next person should be able to run this stack from the document you wrote."

The deliverable from phases 1-4 is one markdown file: RUNBOOK.md. Open a PR adding it to the repo. This is also your interview with the codebase — the act of writing the runbook is what makes you the new owner.

6.1 · Runbook template (paste, fill, commit)

# RUNBOOK · lf-platform infrastructure

## 1. Accounts
| Account name | Account id    | Purpose          | Who has admin |
|--------------|---------------|------------------|---------------|
| Prod         | 111111111111  | prod, prod-supp  | infra-seniors |
| Non-prod     | 222222222222  | uat, test        | infra-platform|

## 2. Environments
| Env          | Account | Region    | VPC CIDR       | State bucket / key                 |
|--------------|---------|-----------|----------------|------------------------------------|
| prod         | 111…     | us-east-1 | 10.10.0.0/16   | lf-tfstate-prod-111 / prod/…     |
| prod-support | 111…     | us-east-1 | 10.11.0.0/16   | lf-tfstate-prod-111 / prod-support/…|
| uat          | 222…     | us-east-1 | 10.20.0.0/16   | lf-tfstate-nonprod-222 / uat/…   |
| test         | 222…     | us-east-1 | 10.30.0.0/16   | lf-tfstate-nonprod-222 / test/…  |

## 3. Deploy roles
- arn:aws:iam::111…:role/TerraformDeploy — assumed by engineers and CD for prod & prod-support
- arn:aws:iam::222…:role/TerraformDeploy — assumed for uat & test
- arn:aws:iam::<ci-acct>:role/lf-github-ci — OIDC role assumed by GitHub Actions

## 4. Pipeline
- CI system: GitHub Actions
- Plan job: on every PR, posts plan as comment
- Apply job: on push to main, env detected via changed paths
- Apply order: test → uat → prod-support → prod (with manual gate before prod)
- Required approvals: 1 for non-prod, 2 senior + security for prod

## 5. State protection
- Buckets versioned + KMS encrypted + public access blocked
- DDB lock table: lf-tfstate-locks (shared)
- prevent_destroy on: aws_rds_cluster.prod, aws_kms_key.app, state buckets

## 6. Local engineer setup
1. aws sso login --sso-session lf
2. AWS_PROFILE=tf-nonprod (or tf-prod)
3. tfenv install (uses .terraform-version)
4. cd envs/<env> && terraform init
5. terraform plan -var-file=<env>.tfvars

## 7. Make a change (BAU SOP summary)
1. Sync main, baseline plan should say "No changes."
2. Branch infra-####-slug
3. Edit smallest possible change (usually only tfvars)
4. fmt + validate + plan locally
5. Commit, push, open PR
6. Wait for CI plan comment, get approvals
7. Squash merge; CD applies
8. Post-apply: terraform plan should say "No changes."

## 8. Known anomalies / drift / non-Terraform resources
- <list things you found in phase 2 that aren't in code>
- <list things in code that prevent_destroy or ignore_changes mark as special>

## 9. Recovery
- Roll back: git revert <sha> on main, CD reapplies. Don't fix in console.
- State lock stuck: terraform force-unlock <ID> (only after confirming no live run).

## 10. Owners & contacts
- Platform: @lf/infra-platform
- Security review: @lf/security
- On-call rotation: PagerDuty schedule "infra-platform"

## 11. Open questions
- <things you still don't know - that's OK, write them down>

6.2 · What goes in section 11 ("Open questions")

This is the most important section. It's where you list everything you couldn't reverse-engineer with confidence. Examples:

  • "There's an EC2 instance i-0deadbeef in prod tagged ManagedBy=manual — nobody knows why it exists. Pinged finance and ops; awaiting answer."
  • "Module modules/legacy-vpn hasn't been touched in 18 months and isn't referenced anywhere. Confirm with networking team it's safe to delete."
  • "The CI role has iam:* permissions in prod. That's broader than it needs. Worth tightening once we have parity tests."
  • "prod-support env's last apply was 11 months ago. Either it's stable, or its state has drifted significantly. Need to plan-refresh-only and review."
Open the PR for RUNBOOK.md early. Even at 50% complete. Reviewers (the few people still on the team who knew the senior) will fill the gaps in their PR comments, and that's free knowledge transfer. Don't wait until it's "done."

07 Red flags & smells — what to worry about

"You're not just inheriting a working system. You're inheriting all the things the previous owner was meaning to fix."

Walk through this list against your repo and AWS account. Anything that fires is a candidate for the runbook's "open questions" section — not necessarily something to fix immediately, but something to know.

7.1 · Repo smells

!
No .terraform.lock.hcl committed. Provider versions can shift between runs — CI plan and local plan can diverge silently. Add it on your first PR.
!
State stored locally / in Git. If you find a terraform.tfstate in the repo or no backend.tf, state has been on someone's laptop. Stop. Migrate to S3 + DDB before any further apply.
!
Hard-coded credentials. grep -rn 'AKIA' . — AWS access keys committed are an immediate rotation event. Same for any aws_secret_access_key string.
!
Secrets in .tfvars. RDS passwords, API tokens. Move to Secrets Manager / SSM. Rotate the leaked ones.
!
One folder for all environments (using workspaces). Workspaces share a backend, so a misconfigured plan can hit prod from a "test" intent. Plan to split into folder-per-env.
!
Modules sourced from forks / personal repos. git@github.com:someone-personal/…. Move them in-repo or into your org's namespace.
!
.tfvars auto-loaded everywhere. Files named terraform.tfvars or *.auto.tfvars auto-load and silently override. Prefer explicit -var-file=<env>.tfvars.
!
Module versions pinned to main. source = "git::…//modules/x?ref=main" means apply behavior shifts as upstream changes. Pin to a tag/sha.

7.2 · AWS smells

!
State bucket without versioning or encryption. Lose state once and you're rebuilding the world. Confirm with aws s3api get-bucket-versioning + get-bucket-encryption.
!
State bucket public. aws s3api get-public-access-block — all four blocks should be true. State files contain secrets.
!
Resources running with no ManagedBy tag. Either drift or legacy. Decide: import or delete.
!
EC2 instances with stale launch templates. ASG launch template is v3, instances still on v1 — instance refresh never completed. Often safe but worth confirming.
!
Security groups open to 0.0.0.0/0 on non-public ports. Especially port 22, 3389, 3306. Run aws ec2 describe-security-groups and grep.
!
RDS without Multi-AZ in prod, or without backups. A single instance failure becomes an outage.
!
IAM roles with *:* on Resource:*. Especially if assumed by CI. Tighten before you trust it for production applies.
!
Old AssumeRole CloudTrail entries from individual humans. Individuals shouldn't be assuming the deploy role for prod — only CD should. Investigate.

7.3 · Pipeline smells

!
CI applies without a plan-review gate. Auto-apply on push to main without human eyes on the plan = the next bad merge becomes the next outage.
!
OIDC role trust policy too loose. Trust should pin both the repo (repo:org/repo) AND the branch/env (ref:refs/heads/main). If it allows any branch, any fork can apply.
!
Long-lived AWS access keys for CI. Rotate to OIDC. Long-lived keys mean any leaked secret = full account access.
!
Required approvals = 0 on prod paths. Check repo branch-protection settings. Should be 2 + CODEOWNERS for prod.
!
No drift-detection job. Drift catches problems while you sleep; without it, drift surfaces during your 09:00 apply — far worse.

7.4 · State smells

!
State serial hasn't moved in 6+ months. Means env has no recent applies — either rock-stable or dangerously drifted. Run terraform plan -refresh-only in a sandbox to find out.
!
Resources in state but not in code. terraform plan wants to destroy them. Likely a previous engineer started a refactor and didn't finish.
!
Resources in code but not in state. Plan wants to create something that already exists in AWS. Fix with terraform import.
!
Multiple state files writing to the same AWS resources. Two roots, both managing the same VPC. Whichever runs second wins. Audit by searching all state files for the same resource id.
Triage rule: a red flag is a known unknown. Write it in the runbook's "Open questions" section. Don't try to fix everything in week 1 — you'll create more problems than you solve. Fix the smallest, safest thing first (next section).

08 The first-safe-change ritual

"You don't really own the system until you've successfully applied a change to it. Make that first change as small and reversible as humanly possible."

This is the rite of passage. It proves the whole pipeline still works for you — SSO, role assumption, CI, CD, state, reality. You'll discover what's broken about your access in the smallest, lowest-stakes way possible.

8.1 · Pick the right "smallest possible" change

ChangeRiskReversibilityRecommended for first?
Add a tag (e.g. RunbookOwner = "your-name") to one env's common_tagsAlmost zeroTrivial revertYes — do this
Add a comment to a tfvars fileZero (no plan diff)TrivialDoesn't actually exercise apply
Bump desired_capacity by 1 in test ASGLowRevert PRGood second change
Add an output to a moduleLowRevertGood second change
Anything in prodHighFull BAU SOPNo. Not yet.
Provider version bumpMedium-highSometimes painfulDefinitely not first

8.2 · The 11-step ritual

  1. Pick the test environment. (Smallest blast radius.)
  2. aws sso login; export AWS_PROFILE=tf-nonprod; aws sts get-caller-identity — confirm role.
  3. cd envs/test; terraform init. Read every line of init output. Note any provider version warnings.
  4. terraform plan -var-file=test.tfvars. Expect "No changes." If you see changes you didn't author, STOP and put it in the runbook's open questions. Don't proceed.
  5. Make the smallest change: add RunbookOwner = "<your-name>" to common_tags in test.tfvars.
  6. terraform fmt; terraform validate.
  7. terraform plan -var-file=test.tfvars. The plan should show only tag-update diffs (typically ~ tags on every taggable resource). Read all of them.
  8. Branch, commit, push, open PR. Title: "INFRA-XXXX: claim test env ownership via tag".
  9. Watch CI. The CI plan in the PR comment should match your local plan exactly. If it doesn't — STOP and figure out why.
  10. Get a reviewer (if no one's on the team, ping CODEOWNERS), squash-merge.
  11. Watch CD apply. Then run terraform plan -var-file=test.tfvars locally one more time. Expect "No changes."
What success means. You have just demonstrated: SSO works for you, role assumption works, init/plan works, CI sees you correctly, OIDC role applies on your behalf, state writes succeed, AWS reflected the change, post-apply reconciles clean. You now own the pipeline.

8.3 · What to do if step 4 (baseline plan) wasn't clean

Likely you've inherited drift. Don't try to "fix" it in your first PR. Instead:

  1. Capture the plan output to a file: terraform plan -var-file=test.tfvars -no-color | tee baseline-drift.txt.
  2. Categorise each diff: drift (AWS changed), code-rot (someone edited code without applying), or in-flight PR (someone else's open PR you haven't pulled).
  3. Add it to the runbook under "Known anomalies." Discuss with the team before reconciling.
  4. Pick a different env (one that does show "No changes") for your first-safe-change ritual. Or, if every env is dirty, fixing one drift item is your first change — but document carefully and have it reviewed.

8.4 · Graduating from this guide

You've completed the inheritance when:

  • RUNBOOK.md is merged to main.
  • Your first-safe-change tag is live in test.
  • You've run a clean plan against every env in the repo.
  • You can answer all the worksheet questions from memory.
  • You've identified at least three "open questions" worth following up on.
  • Your name appears in CODEOWNERS for at least one path.

Now go back to Part 2 and run a real ticket through the BAU SOP. You're no longer inheriting — you're operating.

Final thought. The previous owner left without writing a runbook. You're not them. Write the runbook as you go. Update it every time you discover something. The next inheritor is going to thank you — and that next inheritor, in three years, might be a future version of you.