Terraform BAU — SOPs for AWS Infra Engineers

Practical day-to-day operations: deep file teaching, env-vars, the SOP for creating a new environment, the SOP for editing existing infra, the new-engineer onboarding checklist, and the senior's unwritten rules.

Companion to terraform-301-aws-guide.html (Part 1: fundamentals + AWS architecture).

01 Files deep-dive — what each one does, line by line

New engineers see seven files with confusing names. Here is what each is for, why it exists, and what goes inside it. Terraform reads every .tf file in the current folder, alphabetically, and stitches them together as one big config. The filenames are pure convention — but follow the convention because that is what every reviewer expects.

1.1 — `versions.tf` · the contract

Pins the Terraform binary version and provider versions. First file the senior writes; last one to change.

terraform {
  required_version = ">= 1.6.0, < 2.0.0"     # your binary must be in this range

  required_providers {
    aws = {
      source  = "hashicorp/aws"             # registry namespace
      version = "~> 5.74"                   # 5.74.x ok, 6.x not ok
    }
    random = { source = "hashicorp/random", version = "~> 3.6" }
    tls    = { source = "hashicorp/tls",    version = "~> 4.0" }
  }
}

Why version-pin. Without it, a new bug-fix release of the AWS provider can change tomorrow's plan in subtle ways. Pin and bump intentionally in its own PR.

1.2 — `providers.tf` · how Terraform talks to AWS

provider "aws" {                          # the default (un-aliased) provider
  region = var.aws_region

  assume_role {                            # engineer SSO -> deploy role
    role_arn     = var.deploy_role_arn
    session_name = "tf-${var.environment}"
  }

  default_tags {                           # tag EVERY resource automatically
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
    }
  }
}

provider "aws" {                          # aliased provider, e.g. another region
  alias  = "us_west"
  region = "us-west-2"
}
# Inside a resource use:  provider = aws.us_west

1.3 — `backend.tf` · where state lives

terraform {
  backend "s3" {
    bucket         = "lf-tfstate-nonprod-222"     # the bucket per account
    key            = "uat/network.tfstate"          # <-- the per-env knob
    region         = "us-east-1"
    dynamodb_table = "lf-tfstate-locks"
    encrypt        = true
  }
}

Backend block cannot use variables. It is read before variables are even parsed. Keep separate backend.tf per env folder, or use terraform init -backend-config=....

1.4 — `variables.tf` · declarations only, no values

This file declares what the config accepts as input. It never holds values. Values come from *.tfvars, -var, or TF_VAR_* env vars.

# variables.tf - declarations

variable "environment" {
  type        = string                       # required type
  description = "prod | prod-support | uat | test"

  validation {                              # enforce shape at plan time
    condition     = contains(["prod","prod-support","uat","test"], var.environment)
    error_message = "environment must be prod, prod-support, uat, or test."
  }
}

variable "vpc_cidr" {
  type        = string
  description = "VPC CIDR block, /16"
  default     = "10.0.0.0/16"             # default = optional input
}

variable "db_password" {
  type        = string
  sensitive   = true                         # hides value in plan output
  description = "Master DB password (typically supplied by Secrets Manager, not tfvars)"
}

variable "app_servers" {                     # complex types are first-class
  type = list(object({
    name          = string
    instance_type = string
    public        = bool
  }))
  default = []
}

Variable attribute	What it does
`type`	string, number, bool, list(...), set(...), map(...), object({...}), tuple([...])
`description`	Shows up in `terraform plan` hints and module docs. Always write it.
`default`	Optional. If absent, value MUST be supplied at plan time.
`sensitive`	Redacts from plan/apply output. Still saved to state — protect state.
`nullable`	`false` means callers cannot pass null.
`validation`	Reject bad values at plan time with a friendly error.

1.5 — `data.tf` · read-only lookups

Data sources query AWS without managing the resource. Data sources re-evaluate every run; good for AMIs (you want the newest) but means you can get unexpected diffs — pin AMIs in production.

# data.tf

data "aws_caller_identity" "current" {}        # who am I?
data "aws_region" "current" {}                  # the region

data "aws_availability_zones" "available" {
  state = "available"
}

data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

# Use anywhere as data.<type>.<name>.<attr>
# e.g. data.aws_caller_identity.current.account_id

Resource vs data source rule. If you'd be sad if it disappeared, make it a resource. If you only want to read it (someone else owns it), make it a data.

1.6 — `main.tf` · the table of contents

Despite the name, main.tf rarely contains the bulk of code in a real repo. Resources live in modules. The env-level main.tf is just "this env composes these modules".

# envs/uat/main.tf

# 1. local values - computed once, used in many places
locals {
  name_prefix = "lf-${var.environment}"
  account_id  = data.aws_caller_identity.current.account_id
  tags = merge(var.common_tags, {
    Environment = var.environment
    Account     = local.account_id
  })
}

# 2. module composition - the actual stack for this env
module "network" {
  source     = "../../modules/network"
  name       = "${local.name_prefix}-vpc"
  cidr_block = var.vpc_cidr
  tags       = local.tags
}

module "security" {
  source      = "../../modules/security"
  name_prefix = local.name_prefix
  vpc_id      = module.network.vpc_id          # cross-module reference
  tags        = local.tags
}

Splitting main.tf when it grows. Past ~150 lines, split by concern: network.tf, compute.tf, database.tf, iam.tf. Terraform concatenates anyway.

1.7 — `locals.tf` · computed values used internally

Variables are inputs. Outputs are exports. Locals are computed values used inside. Private to the folder.

locals {
  name_prefix = "lf-${var.environment}"

  base_tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
    CostCenter  = lookup(var.common_tags, "CostCenter", "unallocated")
  }
  tags = merge(local.base_tags, var.common_tags)

  is_prod_like = contains(["prod", "prod-support"], var.environment)
  azs          = slice(data.aws_availability_zones.available.names, 0, 2)

  public_subnets  = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 8, i + 1)]
  private_subnets = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 8, i + 11)]
}

1.8 — `outputs.tf` · the public surface

output "vpc_id" {
  value       = module.network.vpc_id
  description = "VPC id of this environment"
}

output "private_subnet_ids" {
  value = module.network.private_subnet_ids
}

output "db_secret_arn" {
  value     = module.db.secret_arn
  sensitive = true
}

Outputs are how cross-stack glue happens. Another root config can read this state with terraform_remote_state and consume vpc_id.

1.9 — `*.tfvars` files · the values

Variables declare; tfvars supply. Per-env tfvars is the single most important pattern in this guide.

# envs/uat/uat.tfvars
environment      = "uat"
aws_region       = "us-east-1"
deploy_role_arn  = "arn:aws:iam::222222222222:role/TerraformDeploy"
vpc_cidr         = "10.20.0.0/16"
instance_type    = "t3.medium"
common_tags = {
  CostCenter = "CC-1042"
  Owner      = "infra-platform"
  DataClass  = "internal"
}

Load it explicitly: terraform plan -var-file=uat.tfvars. Anything named terraform.tfvars or *.auto.tfvars auto-loads — avoid those in multi-env work.

1.10 — load order recap

Reads backend.tf only — needed before anything else.
Loads every .tf in the folder (alphabetical order is irrelevant; references resolve automatically).
Resolves variable values: defaults → terraform.tfvars → *.auto.tfvars → -var-file → -var → TF_VAR_* (last wins).
Resolves data sources (queries AWS).
Builds the resource graph and plans diffs against state.

02 Environment variables & credentials — how Terraform finds AWS

"It worked on my laptop but failed in CI." Almost always an env-var or credential issue.

2.1 — Terraform's own environment variables

Variable	What it does	When to set it
`TF_VAR_<name>`	Provides a value for input variable `name`. Beats `.tfvars`, loses to `-var`.	CI: pass secrets without writing tfvars to disk.
`TF_LOG`	`TRACE` / `DEBUG` / `INFO` / `WARN` / `ERROR`.	Debugging weird provider errors.
`TF_LOG_PATH`	File path to write logs to instead of stderr.	Capture without polluting your terminal.
`TF_INPUT`	`0` = never prompt for missing input.	CI — you want failure, not a hung job.
`TF_IN_AUTOMATION`	Any non-empty value. Suppresses interactive hints.	CI.
`TF_PLUGIN_CACHE_DIR`	Cache providers across runs — massive speedup.	Developer laptop, CI runners.
`TF_DATA_DIR`	Override `.terraform/` location.	Rare. Only needed for unusual layouts.
`TF_CLI_ARGS` / `TF_CLI_ARGS_plan`	Extra args injected into every (or one) command.	CI: `TF_CLI_ARGS_plan="-no-color"`

# Examples - shell or CI config
export TF_VAR_db_password="$(aws secretsmanager get-secret-value --secret-id db/master --query SecretString --output text)"
export TF_LOG=DEBUG
export TF_LOG_PATH=/tmp/tf-$(date +%s).log
export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"
export TF_IN_AUTOMATION=1
export TF_INPUT=0

2.2 — AWS provider credential chain (the real source of bugs)

The AWS provider tries these in order and uses the first one it finds. Knowing this order saves hours.

Static credentials in the provider block (don't do this).
AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_SESSION_TOKEN env vars.
AWS_PROFILE → reads ~/.aws/credentials and ~/.aws/config.
EC2 instance metadata (IMDS) when running on EC2 with an IAM role.
ECS / EKS task role.

Variable	What it does
`AWS_PROFILE`	Selects a named profile from `~/.aws/config`. Most common on laptops.
`AWS_REGION` / `AWS_DEFAULT_REGION`	Region used when provider doesn't pin one.
`AWS_SDK_LOAD_CONFIG`	Set to `1`. Tells SDK to honour `~/.aws/config` (sso, role_arn, source_profile).
`AWS_ROLE_SESSION_NAME`	Used when assuming a role — shows up in CloudTrail. Set it to your name in CI.

2.3 — Recommended setup · AWS IAM Identity Center (SSO)

# ~/.aws/config
[profile sso-base]
sso_session       = lf
sso_account_id    = 333333333333
sso_role_name     = DeveloperAccess
region            = us-east-1

[profile tf-nonprod]
source_profile    = sso-base
role_arn          = arn:aws:iam::222222222222:role/TerraformDeploy
region            = us-east-1
role_session_name = pponnam-tf-nonprod

[profile tf-prod]
source_profile    = sso-base
role_arn          = arn:aws:iam::111111111111:role/TerraformDeploy
region            = us-east-1
role_session_name = pponnam-tf-prod

[sso-session lf]
sso_start_url  = https://lf.awsapps.com/start
sso_region     = us-east-1

# Daily flow
aws sso login --sso-session lf
export AWS_PROFILE=tf-nonprod
aws sts get-caller-identity                   # verify
cd envs/uat
terraform plan -var-file=uat.tfvars

Never put aws_access_key_id in tfvars or in provider blocks. Use SSO + assume-role; in CI, use OIDC federation.

2.4 — Per-folder env vars with `direnv`

# envs/uat/.envrc - committed (no secrets)
export AWS_PROFILE=tf-nonprod
export AWS_REGION=us-east-1
export AWS_SDK_LOAD_CONFIG=1
export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"

cd envs/uat
direnv allow      # env loads on every cd

2.5 — Variable precedence (memorise)

From lowest to highest priority — later wins:

default in variables.tf.
terraform.tfvars (auto-loaded).
*.auto.tfvars in alphabetical order.
TF_VAR_<name> environment variables.
-var-file=foo.tfvars on the CLI (in order given).
-var name=value on the CLI.

Practical implication. Secrets go in TF_VAR_* sourced from Secrets Manager / Vault — never in a tfvars file in Git.

2.6 — Quick credential debug recipe

aws sts get-caller-identity                            # whoami
echo "AWS_PROFILE=$AWS_PROFILE  AWS_REGION=$AWS_REGION"    # active profile
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::222222222222:role/TerraformDeploy \
  --action-names ec2:CreateVpc                          # can role do this?
TF_LOG=DEBUG TF_LOG_PATH=/tmp/tf.log terraform plan -var-file=uat.tfvars
tail -f /tmp/tf.log                                     # debug output

03 SOP — create a new environment from scratch

Use case: the team needs a new env called preprod in the non-prod account for a regulated workload before it gets promoted to prod. Senior gives you the ticket. This is what you do, top to bottom. Do not skip steps; do not reorder them.

Pre-requisites you must have before starting: (1) account decision — which AWS account hosts preprod; (2) CIDR allocation that doesn't overlap with the four existing envs; (3) approval from network team for VPC peering if needed; (4) a Jira/ServiceNow ticket id.

Get the green-light artefacts

Confirm in writing (ticket comment): account id, region, CIDR, intended use, owner team, retention/backup expectations, target go-live date, two reviewers.

Why: creating an env without a written charter is the #1 source of "why does this even exist?" cleanup tickets six months later.

Pre-create AWS-side prerequisites (one-time, separate PR)

The deploy IAM role and the state-bucket key path must exist before you can run any Terraform for the new env.

# What needs to exist before step 4:
arn:aws:iam::222222222222:role/TerraformDeploy            # reused
s3://lf-tfstate-nonprod-222/preprod/network.tfstate       # key implicit on first put
DynamoDB table lf-tfstate-locks                           # shared

Why: the S3 backend creates the object on first apply, but it needs the bucket and lock table to already exist.

Branch from main — ticket-named

git checkout main && git pull --ff-only
git checkout -b infra-2104-add-preprod-env

Copy the closest existing env as template

cp -r envs/uat envs/preprod
cd envs/preprod
ls
# backend.tf  main.tf  outputs.tf  providers.tf  uat.tfvars  variables.tf  versions.tf  .envrc

Rename the tfvars file (use git mv, preserves history)

git mv uat.tfvars preprod.tfvars

Edit `backend.tf` — the state key MUST be unique

terraform {
  backend "s3" {
    bucket         = "lf-tfstate-nonprod-222"
    key            = "preprod/network.tfstate"     # <-- changed
    region         = "us-east-1"
    dynamodb_table = "lf-tfstate-locks"
    encrypt        = true
  }
}

Gotcha: if you forget this, you'll write into UAT's state file. Plan will look insane.

Edit `providers.tf` — account-aware

provider "aws" {
  region = var.aws_region
  assume_role {
    role_arn     = var.deploy_role_arn
    session_name = "tf-preprod-${terraform.workspace}"
  }
  default_tags {
    tags = {
      Environment = "preprod"
      ManagedBy   = "terraform"
    }
  }
}

Edit `preprod.tfvars` — the per-env knobs

environment     = "preprod"
aws_region      = "us-east-1"
deploy_role_arn = "arn:aws:iam::222222222222:role/TerraformDeploy"
vpc_cidr        = "10.40.0.0/16"                     # non-overlapping
instance_type   = "m6i.large"                       # prod-shaped
cluster_size    = 2
backup_retention_days = 14
common_tags = {
  CostCenter = "CC-2104"
  Owner      = "infra-platform"
  DataClass  = "confidential"
}

Update validation list in `variables.tf`

variable "environment" {
  type = string
  validation {
    condition     = contains(["prod","prod-support","uat","test","preprod"], var.environment)
    error_message = "environment must be one of the supported values."
  }
}

Why: if you skip this, plan blows up at validation time with a confusing error.

Add the env to `CODEOWNERS`

/envs/preprod/  @lf/infra-platform @lf/security

Initialise the new env

cd envs/preprod
terraform fmt -recursive ../../
terraform init
# If it offers to copy state from an old key - say NO. Fresh env, fresh state.

Validate & plan

terraform validate
terraform plan -var-file=preprod.tfvars -out=tfplan

For a brand new env, expect the full set of resources as + create. Count should match what UAT had.

Commit, push, open PR

git add -A
git commit -m "INFRA-2104: add preprod environment in non-prod account"
git push -u origin HEAD
gh pr create --fill

PR description: ticket link, account id, CIDR, plan summary, network team's approval reference, rollback plan.

CI runs — wait for green

CI re-runs fmt -check, validate, plan, tflint, tfsec, posts plan as PR comment.

Gotcha: if CI plan differs from local, you forgot to commit a file.

Reviews & merge

Two CODEOWNERS approvals. Squash merge into main.

CD applies — watch it

RDS cluster creation is the long pole — expect 6-10 minutes.

Post-apply verification

cd envs/preprod
terraform plan -var-file=preprod.tfvars
# EXPECT: "No changes. Your infrastructure matches the configuration."

terraform output
aws ec2 describe-vpcs --vpc-ids $(terraform output -raw vpc_id)

Update docs & close the ticket

Update README's "supported environments" list. Add an entry to the per-env knobs table. Comment on ticket with VPC id and merge commit. Close ticket.

Total time: a careful first-time pass takes a couple of hours; the apply itself is ~15 minutes. After you've done it twice, the human work is < 30 minutes.

04 SOP — daily BAU: edit existing infra and apply

Use case: a ticket lands — "INFRA-2210: increase UAT app tier from m6i.large to m6i.xlarge for load testing next Tuesday." Internalise this routine.

Read the ticket fully — including the comments

Confirm: which env(s)? what is changing? deadline? requestor? "do not destroy" notes? Write a one-line plan of attack in the ticket comment before touching code.

Why: 30% of "incidents" come from changes that solved the wrong ticket. Two minutes of reading saves hours of rollback.

Sync main and confirm a clean baseline

git checkout main && git pull --ff-only
cd envs/uat
direnv allow
terraform init
terraform plan -var-file=uat.tfvars
# EXPECT: "No changes. Your infrastructure matches the configuration."

If this plan shows changes you didn't author, STOP. Drift or another in-flight PR. Investigate before adding your change on top.

Branch from main

git checkout -b infra-2210-uat-app-tier-xlarge

Find the file that owns the value

grep -n instance_type envs/uat/uat.tfvars
# envs/uat/uat.tfvars:5: instance_type = "m6i.large"

# If the value isn't in tfvars, walk up: env main.tf -> module variables.tf
grep -rn instance_type modules/compute/

Make the smallest possible change

# envs/uat/uat.tfvars (diff)
-instance_type = "m6i.large"
+instance_type = "m6i.xlarge"

Why so disciplined: reviewers should approve in seconds. Bundling three things gets push-back and the urgent fix waits.

Format and validate

terraform fmt -recursive ../../
terraform validate

Plan and read every line

terraform plan -var-file=uat.tfvars -out=tfplan
# module.app.aws_launch_template.app will be updated in-place
#   ~ instance_type = "m6i.large" -> "m6i.xlarge"
# Plan: 0 to add, 1 to change, 0 to destroy.

Red flags in plan: any - destroy for a stateful resource (RDS, EBS); replacement -/+; changes you didn't author. Stop and ask senior.

Commit with the ticket id, push, open PR

git diff
git add -A
git commit -m "INFRA-2210: bump UAT app tier to m6i.xlarge for load test"
git push -u origin HEAD
gh pr create --fill

Wait for CI — the bot replays your plan

CI: fmt -check → validate → plan → security scans (tflint, tfsec) → posts plan as PR comment. Should be identical to your local plan.

Tag the right reviewer

UAT: one platform reviewer. Prod: two senior + security per CODEOWNERS. Don't re-request review until you've addressed comments.

Squash & merge

gh pr merge --squash --delete-branch

Why squash: one PR = one commit on main. Easy to git revert if it goes wrong.

CD applies on merge — monitor it

For an instance-type bump, ASG instance refresh takes 5-10 minutes.

gh run watch                              # GitHub Actions live
# Drop a Slack note in #infra-changes when apply starts and finishes.

Post-apply verification — never skip

terraform plan -var-file=uat.tfvars
# EXPECT: "No changes."

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names lf-uat-app-asg \
  --query 'AutoScalingGroups[].LaunchTemplate.Version'

aws ec2 describe-instances \
  --filters "Name=tag:Environment,Values=uat" "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].[InstanceId,InstanceType]'

Close the loop

Comment on ticket: "Applied at 14:32 UTC, instance refresh complete, all 4 instances now m6i.xlarge, plan is clean." Close ticket. Move on.

4.x — The "always-on" minimum-viable cadence

Cadence	Action	Why
Every PR	Plan locally + plan on CI — compare	Catch missing-file commits, env drift
Daily (test/uat)	Drift-detection job: `terraform plan -detailed-exitcode`	Detect console clicks before they bite
Weekly	Review provider release notes; bump `.terraform.lock.hcl` in a dedicated PR	Stay current without surprise
Monthly	Cost review: `infracost` diff PR posts	Catch a junior accidentally provisioning `db.r6g.16xlarge`
Quarterly	Module audit: anything not used? anything that should be a module?	Repo doesn't rot

4.y — "I broke something" recovery flow

Stop. Don't make it worse with another apply.
Re-read the latest plan/apply log. Capture the error.
Decide: roll forward (small fix PR) or roll back (git revert <sha> on main, merge, CD applies the previous state).
If state and reality disagree, see Part 1 section 9 troubleshooting. Use terraform import / state rm rather than fighting plan.
Document what happened in the ticket. If customer-impacting, file a postmortem.

"Roll back" almost always means reverting the merge commit on main and letting CD reapply. Do not log into the AWS console and "fix" things by hand.

05 New-engineer onboarding — Day 1, Week 1, Month 1

Hand this to a new engineer on day one. Senior validates each box as it gets checked. By the end of month one, they should be running BAU tickets unsupervised on test/uat and shadowing on prod.

Day 1 · environment setup & orientation

Get added to GitHub team lf/infra-platform. Confirm push access to terraform-aws-platform.

Get IAM Identity Center (SSO) access. Verify aws sso login --sso-session lf works.

Configure ~/.aws/config with the tf-nonprod profile (section 2.3).

Run aws sts get-caller-identity --profile tf-nonprod. Role arn should be TerraformDeploy.

Install tfenv, then tfenv install from the repo's .terraform-version.

Install terraform-docs, tflint, tfsec, direnv, gh CLI.

Clone the repo. Run terraform fmt -check -recursive; expect no output (clean).

Read Part 1 of the guide end-to-end. Yes, all of it. Especially troubleshooting and the BAU SOP.

Ask senior to walk through the AWS architecture diagram (Part 1 section 5) live, in their own words.

Open the live AWS console for non-prod, click around uat VPC. Match what you see to the diagram.

Week 1 · first tickets, supervised

Pick a "good first issue" labelled ticket on test — ideally a tag change or a small SG rule.

Run the BAU SOP (section 4) start to finish. Pair with senior on the plan review.

Open the PR. Ask senior to review live, on a call, narrating what they look at.

Watch the CD apply log live. Verify post-apply.

Repeat with two more test tickets that week.

Read the last 20 merged PRs. Note patterns: how plans look, what review comments senior leaves.

Run the interactive walkthrough (Part 1 section 7) for all four scenarios at least once.

Spend 30 minutes intentionally breaking things in test: force-unlock, state rm, import, observe.

Month 1 · supervised → trusted on uat

Run a uat ticket end-to-end. Senior reviews PR but does not pair on the plan.

Author or improve one module (e.g. add an output, parameterise a hard-coded value).

Lead one drift-detection investigation: read the surprise plan, decide import vs. fix-the-config.

Shadow senior on a prod change. Don't merge it — just watch the review & CD.

Add yourself to the on-call rotation for infra-platform (read-only).

Present a 10-min lightning talk to the team on something you learned (forces consolidation).

By end of month: you can explain the 2-account / 4-env layout, the SG chain, and the BAU SOP without notes.

After month 1 · growth path

Take prod-support tickets. Two senior reviewers still required.

Add a new module from scratch (e.g., modules/secrets).

Lead a new env stand-up (section 3) start-to-finish.

Become the second reviewer on prod PRs (continued senior co-sign).

Eventually, become the senior who hands this checklist to the next new engineer.

Senior's note to self: the goal of this checklist isn't to gatekeep — it's to make sure the new engineer touches every important seam in the system under supervision before they hit it under pressure. Skip steps at your own peril.

06 Senior's insights — the unwritten rules

Things you'd learn the hard way over five years. Read them now.

1. The boring PR is the best PR.

One change, one ticket, one commit, one apply. Bundling "while I'm here, let me also…" is how outages happen. Resist it. Open a separate PR.

2. Read every line of the plan. Especially the destroys.

Treat - destroy on a stateful resource (RDS, EBS, KMS, S3) as an emergency until proven otherwise.

3. State is the truth. Code describes intent. AWS shows reality. They diverge — that's drift, and it's normal.

Your job is to bring the three back into agreement. terraform plan is the diff between code and state. The console is the only honest answer for "what is actually running."

4. for_each over count, almost always.

count uses positional indices — remove the second item and items 3, 4, 5 shift to 2, 3, 4 and Terraform recreates them. for_each uses keys — stable across edits.

5. Tag everything, tag consistently. Use default_tags.

Required tags: Environment, Owner, CostCenter, DataClass, ManagedBy=terraform. Without tags you cannot bill, audit, or page the right team.

6. Never put secrets in tfvars. Ever.

Tfvars files end up in Git, in CI logs, in someone's screenshare. Use Secrets Manager / Parameter Store / Vault, fetch at runtime via data sources or TF_VAR_*.

7. Modules are contracts. Don't break them silently.

If your module is consumed by other roots, treat its inputs and outputs as public API. Renaming a variable is a breaking change.

8. Pin everything. Provider versions, Terraform versions, AMI ids in prod.

"Latest" is fine in test. In prod, "latest" means "your config can change between two applies that ran the same code."

9. prevent_destroy on the irreplaceable.

resource "aws_rds_cluster" "prod" {
  lifecycle {
    prevent_destroy = true
    ignore_changes  = [master_password]
  }
}

10. create_before_destroy for things behind a load balancer.

Default is destroy-then-create. For an ALB target group, that means downtime. Set create_before_destroy = true on launch templates and target groups.

11. ignore_changes for things you don't own.

If autoscaling adjusts desired_capacity, or a deploy tool changes a tag, Terraform fights it. Add the attribute to ignore_changes — sparingly — with a comment explaining why.

12. terraform plan is read-only and free. Run it constantly.

Before lunch, after pulling main, before opening a PR, after reviewing someone else's. The cheapest test in the toolbox.

13. If a plan is mysterious, blame the data sources first.

Data sources re-evaluate every run. A new AMI was published → your launch template wants to update. Pin the AMI in prod or accept the diff.

14. Don't fight Terraform — teach it.

Use terraform import and terraform state mv, not "let me destroy and recreate to match."

15. The folder is the blast-radius boundary.

One folder = one state = one env = one apply. Resist the urge to merge envs into one folder for "DRY" — the duplication is intentional and protective.

16. CI plan is the source of truth for review.

What the reviewer reads is the plan posted by CI on the PR. If your local plan and CI plan differ, you forgot to commit something.

17. Naming: short, environment-prefixed, kebab-case.

lf-prod-web-asg, not production_web_autoscaling_group_v2.

18. Don't apply prod from your laptop. Ever.

Apply happens in CD, with an auditable trail. Even if the build is broken — fix the build.

19. Run drift detection on a schedule.

A nightly terraform plan -detailed-exitcode per env that pages on non-zero. Drift caught at 02:00 is not an outage; drift caught during a 09:00 apply is.

20. Document the why, not the what.

The code shows what. Comments and PR descriptions exist for the why. "Increased to xlarge for load test on Tuesday" is a commit message. "// xlarge" is noise.

Internalise these and you've crossed the line from "I can write Terraform" to "I can be trusted with production." That's the actual 301.

01 Files deep-dive — what each one does, line by line

1.1 — versions.tf · the contract

1.2 — providers.tf · how Terraform talks to AWS

1.3 — backend.tf · where state lives

1.4 — variables.tf · declarations only, no values

1.5 — data.tf · read-only lookups

1.6 — main.tf · the table of contents

1.7 — locals.tf · computed values used internally

1.8 — outputs.tf · the public surface

1.9 — *.tfvars files · the values

1.10 — load order recap

02 Environment variables & credentials — how Terraform finds AWS

2.1 — Terraform's own environment variables

2.2 — AWS provider credential chain (the real source of bugs)

2.3 — Recommended setup · AWS IAM Identity Center (SSO)

2.4 — Per-folder env vars with direnv

2.5 — Variable precedence (memorise)

2.6 — Quick credential debug recipe

03 SOP — create a new environment from scratch

Get the green-light artefacts

Pre-create AWS-side prerequisites (one-time, separate PR)

Branch from main — ticket-named

Copy the closest existing env as template

Rename the tfvars file (use git mv, preserves history)

Edit backend.tf — the state key MUST be unique

Edit providers.tf — account-aware

Edit preprod.tfvars — the per-env knobs

Update validation list in variables.tf

Add the env to CODEOWNERS

Initialise the new env

Validate & plan

Commit, push, open PR

CI runs — wait for green

Reviews & merge

CD applies — watch it

Post-apply verification

Update docs & close the ticket

04 SOP — daily BAU: edit existing infra and apply

Read the ticket fully — including the comments

Sync main and confirm a clean baseline

Branch from main

Find the file that owns the value

Make the smallest possible change

Format and validate

Plan and read every line

Commit with the ticket id, push, open PR

Wait for CI — the bot replays your plan

Tag the right reviewer

Squash & merge

CD applies on merge — monitor it

Post-apply verification — never skip

Close the loop

4.x — The "always-on" minimum-viable cadence

4.y — "I broke something" recovery flow

05 New-engineer onboarding — Day 1, Week 1, Month 1

06 Senior's insights — the unwritten rules

1.1 — `versions.tf` · the contract

1.2 — `providers.tf` · how Terraform talks to AWS

1.3 — `backend.tf` · where state lives

1.4 — `variables.tf` · declarations only, no values

1.5 — `data.tf` · read-only lookups

1.6 — `main.tf` · the table of contents

1.7 — `locals.tf` · computed values used internally

1.8 — `outputs.tf` · the public surface

1.9 — `*.tfvars` files · the values

2.4 — Per-folder env vars with `direnv`

Edit `backend.tf` — the state key MUST be unique

Edit `providers.tf` — account-aware

Edit `preprod.tfvars` — the per-env knobs

Update validation list in `variables.tf`

Add the env to `CODEOWNERS`