Terraform BAU — SOPs for AWS Infra Engineers
Practical day-to-day operations: deep file teaching, env-vars, the SOP for creating a new environment, the SOP for editing existing infra, the new-engineer onboarding checklist, and the senior's unwritten rules.
01 Files deep-dive — what each one does, line by line
New engineers see seven files with confusing names. Here is what each is for, why it exists, and what goes inside it. Terraform reads every .tf file in the current folder, alphabetically, and stitches them together as one big config. The filenames are pure convention — but follow the convention because that is what every reviewer expects.
1.1 — versions.tf · the contract
Pins the Terraform binary version and provider versions. First file the senior writes; last one to change.
terraform { required_version = ">= 1.6.0, < 2.0.0" # your binary must be in this range required_providers { aws = { source = "hashicorp/aws" # registry namespace version = "~> 5.74" # 5.74.x ok, 6.x not ok } random = { source = "hashicorp/random", version = "~> 3.6" } tls = { source = "hashicorp/tls", version = "~> 4.0" } } }
1.2 — providers.tf · how Terraform talks to AWS
provider "aws" { # the default (un-aliased) provider region = var.aws_region assume_role { # engineer SSO -> deploy role role_arn = var.deploy_role_arn session_name = "tf-${var.environment}" } default_tags { # tag EVERY resource automatically tags = { Environment = var.environment ManagedBy = "terraform" } } } provider "aws" { # aliased provider, e.g. another region alias = "us_west" region = "us-west-2" } # Inside a resource use: provider = aws.us_west
1.3 — backend.tf · where state lives
terraform { backend "s3" { bucket = "lf-tfstate-nonprod-222" # the bucket per account key = "uat/network.tfstate" # <-- the per-env knob region = "us-east-1" dynamodb_table = "lf-tfstate-locks" encrypt = true } }
backend.tf per env folder, or use terraform init -backend-config=....1.4 — variables.tf · declarations only, no values
This file declares what the config accepts as input. It never holds values. Values come from *.tfvars, -var, or TF_VAR_* env vars.
# variables.tf - declarations variable "environment" { type = string # required type description = "prod | prod-support | uat | test" validation { # enforce shape at plan time condition = contains(["prod","prod-support","uat","test"], var.environment) error_message = "environment must be prod, prod-support, uat, or test." } } variable "vpc_cidr" { type = string description = "VPC CIDR block, /16" default = "10.0.0.0/16" # default = optional input } variable "db_password" { type = string sensitive = true # hides value in plan output description = "Master DB password (typically supplied by Secrets Manager, not tfvars)" } variable "app_servers" { # complex types are first-class type = list(object({ name = string instance_type = string public = bool })) default = [] }
| Variable attribute | What it does |
|---|---|
type | string, number, bool, list(...), set(...), map(...), object({...}), tuple([...]) |
description | Shows up in terraform plan hints and module docs. Always write it. |
default | Optional. If absent, value MUST be supplied at plan time. |
sensitive | Redacts from plan/apply output. Still saved to state — protect state. |
nullable | false means callers cannot pass null. |
validation | Reject bad values at plan time with a friendly error. |
1.5 — data.tf · read-only lookups
Data sources query AWS without managing the resource. Data sources re-evaluate every run; good for AMIs (you want the newest) but means you can get unexpected diffs — pin AMIs in production.
# data.tf data "aws_caller_identity" "current" {} # who am I? data "aws_region" "current" {} # the region data "aws_availability_zones" "available" { state = "available" } data "aws_ami" "al2023" { most_recent = true owners = ["amazon"] filter { name = "name" values = ["al2023-ami-*-x86_64"] } } # Use anywhere as data.<type>.<name>.<attr> # e.g. data.aws_caller_identity.current.account_id
resource. If you only want to read it (someone else owns it), make it a data.1.6 — main.tf · the table of contents
Despite the name, main.tf rarely contains the bulk of code in a real repo. Resources live in modules. The env-level main.tf is just "this env composes these modules".
# envs/uat/main.tf # 1. local values - computed once, used in many places locals { name_prefix = "lf-${var.environment}" account_id = data.aws_caller_identity.current.account_id tags = merge(var.common_tags, { Environment = var.environment Account = local.account_id }) } # 2. module composition - the actual stack for this env module "network" { source = "../../modules/network" name = "${local.name_prefix}-vpc" cidr_block = var.vpc_cidr tags = local.tags } module "security" { source = "../../modules/security" name_prefix = local.name_prefix vpc_id = module.network.vpc_id # cross-module reference tags = local.tags }
main.tf when it grows. Past ~150 lines, split by concern: network.tf, compute.tf, database.tf, iam.tf. Terraform concatenates anyway.1.7 — locals.tf · computed values used internally
Variables are inputs. Outputs are exports. Locals are computed values used inside. Private to the folder.
locals { name_prefix = "lf-${var.environment}" base_tags = { Environment = var.environment ManagedBy = "terraform" CostCenter = lookup(var.common_tags, "CostCenter", "unallocated") } tags = merge(local.base_tags, var.common_tags) is_prod_like = contains(["prod", "prod-support"], var.environment) azs = slice(data.aws_availability_zones.available.names, 0, 2) public_subnets = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 8, i + 1)] private_subnets = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 8, i + 11)] }
1.8 — outputs.tf · the public surface
output "vpc_id" { value = module.network.vpc_id description = "VPC id of this environment" } output "private_subnet_ids" { value = module.network.private_subnet_ids } output "db_secret_arn" { value = module.db.secret_arn sensitive = true }
terraform_remote_state and consume vpc_id.1.9 — *.tfvars files · the values
Variables declare; tfvars supply. Per-env tfvars is the single most important pattern in this guide.
# envs/uat/uat.tfvars environment = "uat" aws_region = "us-east-1" deploy_role_arn = "arn:aws:iam::222222222222:role/TerraformDeploy" vpc_cidr = "10.20.0.0/16" instance_type = "t3.medium" common_tags = { CostCenter = "CC-1042" Owner = "infra-platform" DataClass = "internal" }
Load it explicitly: terraform plan -var-file=uat.tfvars. Anything named terraform.tfvars or *.auto.tfvars auto-loads — avoid those in multi-env work.
1.10 — load order recap
- Reads
backend.tfonly — needed before anything else. - Loads every
.tfin the folder (alphabetical order is irrelevant; references resolve automatically). - Resolves variable values: defaults →
terraform.tfvars→*.auto.tfvars→-var-file→-var→TF_VAR_*(last wins). - Resolves data sources (queries AWS).
- Builds the resource graph and plans diffs against state.
02 Environment variables & credentials — how Terraform finds AWS
"It worked on my laptop but failed in CI." Almost always an env-var or credential issue.
2.1 — Terraform's own environment variables
| Variable | What it does | When to set it |
|---|---|---|
TF_VAR_<name> | Provides a value for input variable name. Beats .tfvars, loses to -var. | CI: pass secrets without writing tfvars to disk. |
TF_LOG | TRACE / DEBUG / INFO / WARN / ERROR. | Debugging weird provider errors. |
TF_LOG_PATH | File path to write logs to instead of stderr. | Capture without polluting your terminal. |
TF_INPUT | 0 = never prompt for missing input. | CI — you want failure, not a hung job. |
TF_IN_AUTOMATION | Any non-empty value. Suppresses interactive hints. | CI. |
TF_PLUGIN_CACHE_DIR | Cache providers across runs — massive speedup. | Developer laptop, CI runners. |
TF_DATA_DIR | Override .terraform/ location. | Rare. Only needed for unusual layouts. |
TF_CLI_ARGS / TF_CLI_ARGS_plan | Extra args injected into every (or one) command. | CI: TF_CLI_ARGS_plan="-no-color" |
# Examples - shell or CI config export TF_VAR_db_password="$(aws secretsmanager get-secret-value --secret-id db/master --query SecretString --output text)" export TF_LOG=DEBUG export TF_LOG_PATH=/tmp/tf-$(date +%s).log export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache" export TF_IN_AUTOMATION=1 export TF_INPUT=0
2.2 — AWS provider credential chain (the real source of bugs)
The AWS provider tries these in order and uses the first one it finds. Knowing this order saves hours.
- Static credentials in the provider block (don't do this).
AWS_ACCESS_KEY_ID+AWS_SECRET_ACCESS_KEY+AWS_SESSION_TOKENenv vars.AWS_PROFILE→ reads~/.aws/credentialsand~/.aws/config.- EC2 instance metadata (IMDS) when running on EC2 with an IAM role.
- ECS / EKS task role.
| Variable | What it does |
|---|---|
AWS_PROFILE | Selects a named profile from ~/.aws/config. Most common on laptops. |
AWS_REGION / AWS_DEFAULT_REGION | Region used when provider doesn't pin one. |
AWS_SDK_LOAD_CONFIG | Set to 1. Tells SDK to honour ~/.aws/config (sso, role_arn, source_profile). |
AWS_ROLE_SESSION_NAME | Used when assuming a role — shows up in CloudTrail. Set it to your name in CI. |
2.3 — Recommended setup · AWS IAM Identity Center (SSO)
# ~/.aws/config [profile sso-base] sso_session = lf sso_account_id = 333333333333 sso_role_name = DeveloperAccess region = us-east-1 [profile tf-nonprod] source_profile = sso-base role_arn = arn:aws:iam::222222222222:role/TerraformDeploy region = us-east-1 role_session_name = pponnam-tf-nonprod [profile tf-prod] source_profile = sso-base role_arn = arn:aws:iam::111111111111:role/TerraformDeploy region = us-east-1 role_session_name = pponnam-tf-prod [sso-session lf] sso_start_url = https://lf.awsapps.com/start sso_region = us-east-1
# Daily flow aws sso login --sso-session lf export AWS_PROFILE=tf-nonprod aws sts get-caller-identity # verify cd envs/uat terraform plan -var-file=uat.tfvars
aws_access_key_id in tfvars or in provider blocks. Use SSO + assume-role; in CI, use OIDC federation.2.4 — Per-folder env vars with direnv
# envs/uat/.envrc - committed (no secrets) export AWS_PROFILE=tf-nonprod export AWS_REGION=us-east-1 export AWS_SDK_LOAD_CONFIG=1 export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"
cd envs/uat
direnv allow # env loads on every cd
2.5 — Variable precedence (memorise)
From lowest to highest priority — later wins:
defaultinvariables.tf.terraform.tfvars(auto-loaded).*.auto.tfvarsin alphabetical order.TF_VAR_<name>environment variables.-var-file=foo.tfvarson the CLI (in order given).-var name=valueon the CLI.
TF_VAR_* sourced from Secrets Manager / Vault — never in a tfvars file in Git.2.6 — Quick credential debug recipe
aws sts get-caller-identity # whoami echo "AWS_PROFILE=$AWS_PROFILE AWS_REGION=$AWS_REGION" # active profile aws iam simulate-principal-policy \ --policy-source-arn arn:aws:iam::222222222222:role/TerraformDeploy \ --action-names ec2:CreateVpc # can role do this? TF_LOG=DEBUG TF_LOG_PATH=/tmp/tf.log terraform plan -var-file=uat.tfvars tail -f /tmp/tf.log # debug output
03 SOP — create a new environment from scratch
Use case: the team needs a new env called preprod in the non-prod account for a regulated workload before it gets promoted to prod. Senior gives you the ticket. This is what you do, top to bottom. Do not skip steps; do not reorder them.
preprod; (2) CIDR allocation that doesn't overlap with the four existing envs; (3) approval from network team for VPC peering if needed; (4) a Jira/ServiceNow ticket id.Get the green-light artefacts
Confirm in writing (ticket comment): account id, region, CIDR, intended use, owner team, retention/backup expectations, target go-live date, two reviewers.
Pre-create AWS-side prerequisites (one-time, separate PR)
The deploy IAM role and the state-bucket key path must exist before you can run any Terraform for the new env.
# What needs to exist before step 4: arn:aws:iam::222222222222:role/TerraformDeploy # reused s3://lf-tfstate-nonprod-222/preprod/network.tfstate # key implicit on first put DynamoDB table lf-tfstate-locks # shared
Branch from main — ticket-named
git checkout main && git pull --ff-only git checkout -b infra-2104-add-preprod-env
Copy the closest existing env as template
cp -r envs/uat envs/preprod
cd envs/preprod
ls
# backend.tf main.tf outputs.tf providers.tf uat.tfvars variables.tf versions.tf .envrc
Rename the tfvars file (use git mv, preserves history)
git mv uat.tfvars preprod.tfvars
Edit backend.tf — the state key MUST be unique
terraform { backend "s3" { bucket = "lf-tfstate-nonprod-222" key = "preprod/network.tfstate" # <-- changed region = "us-east-1" dynamodb_table = "lf-tfstate-locks" encrypt = true } }
Edit providers.tf — account-aware
provider "aws" { region = var.aws_region assume_role { role_arn = var.deploy_role_arn session_name = "tf-preprod-${terraform.workspace}" } default_tags { tags = { Environment = "preprod" ManagedBy = "terraform" } } }
Edit preprod.tfvars — the per-env knobs
environment = "preprod" aws_region = "us-east-1" deploy_role_arn = "arn:aws:iam::222222222222:role/TerraformDeploy" vpc_cidr = "10.40.0.0/16" # non-overlapping instance_type = "m6i.large" # prod-shaped cluster_size = 2 backup_retention_days = 14 common_tags = { CostCenter = "CC-2104" Owner = "infra-platform" DataClass = "confidential" }
Update validation list in variables.tf
variable "environment" { type = string validation { condition = contains(["prod","prod-support","uat","test","preprod"], var.environment) error_message = "environment must be one of the supported values." } }
Add the env to CODEOWNERS
/envs/preprod/ @lf/infra-platform @lf/security
Initialise the new env
cd envs/preprod
terraform fmt -recursive ../../
terraform init
# If it offers to copy state from an old key - say NO. Fresh env, fresh state.
Validate & plan
terraform validate terraform plan -var-file=preprod.tfvars -out=tfplan
For a brand new env, expect the full set of resources as + create. Count should match what UAT had.
Commit, push, open PR
git add -A
git commit -m "INFRA-2104: add preprod environment in non-prod account"
git push -u origin HEAD
gh pr create --fill
PR description: ticket link, account id, CIDR, plan summary, network team's approval reference, rollback plan.
CI runs — wait for green
CI re-runs fmt -check, validate, plan, tflint, tfsec, posts plan as PR comment.
Reviews & merge
Two CODEOWNERS approvals. Squash merge into main.
CD applies — watch it
RDS cluster creation is the long pole — expect 6-10 minutes.
Post-apply verification
cd envs/preprod
terraform plan -var-file=preprod.tfvars
# EXPECT: "No changes. Your infrastructure matches the configuration."
terraform output
aws ec2 describe-vpcs --vpc-ids $(terraform output -raw vpc_id)
Update docs & close the ticket
Update README's "supported environments" list. Add an entry to the per-env knobs table. Comment on ticket with VPC id and merge commit. Close ticket.
04 SOP — daily BAU: edit existing infra and apply
Use case: a ticket lands — "INFRA-2210: increase UAT app tier from m6i.large to m6i.xlarge for load testing next Tuesday." Internalise this routine.
Read the ticket fully — including the comments
Confirm: which env(s)? what is changing? deadline? requestor? "do not destroy" notes? Write a one-line plan of attack in the ticket comment before touching code.
Sync main and confirm a clean baseline
git checkout main && git pull --ff-only
cd envs/uat
direnv allow
terraform init
terraform plan -var-file=uat.tfvars
# EXPECT: "No changes. Your infrastructure matches the configuration."
Branch from main
git checkout -b infra-2210-uat-app-tier-xlarge
Find the file that owns the value
grep -n instance_type envs/uat/uat.tfvars # envs/uat/uat.tfvars:5: instance_type = "m6i.large" # If the value isn't in tfvars, walk up: env main.tf -> module variables.tf grep -rn instance_type modules/compute/
Make the smallest possible change
# envs/uat/uat.tfvars (diff) -instance_type = "m6i.large" +instance_type = "m6i.xlarge"
Format and validate
terraform fmt -recursive ../../ terraform validate
Plan and read every line
terraform plan -var-file=uat.tfvars -out=tfplan # module.app.aws_launch_template.app will be updated in-place # ~ instance_type = "m6i.large" -> "m6i.xlarge" # Plan: 0 to add, 1 to change, 0 to destroy.
- destroy for a stateful resource (RDS, EBS); replacement -/+; changes you didn't author. Stop and ask senior.Commit with the ticket id, push, open PR
git diff
git add -A
git commit -m "INFRA-2210: bump UAT app tier to m6i.xlarge for load test"
git push -u origin HEAD
gh pr create --fill
Wait for CI — the bot replays your plan
CI: fmt -check → validate → plan → security scans (tflint, tfsec) → posts plan as PR comment. Should be identical to your local plan.
Tag the right reviewer
UAT: one platform reviewer. Prod: two senior + security per CODEOWNERS. Don't re-request review until you've addressed comments.
Squash & merge
gh pr merge --squash --delete-branch
git revert if it goes wrong.CD applies on merge — monitor it
For an instance-type bump, ASG instance refresh takes 5-10 minutes.
gh run watch # GitHub Actions live # Drop a Slack note in #infra-changes when apply starts and finishes.
Post-apply verification — never skip
terraform plan -var-file=uat.tfvars # EXPECT: "No changes." aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names lf-uat-app-asg \ --query 'AutoScalingGroups[].LaunchTemplate.Version' aws ec2 describe-instances \ --filters "Name=tag:Environment,Values=uat" "Name=instance-state-name,Values=running" \ --query 'Reservations[].Instances[].[InstanceId,InstanceType]'
Close the loop
Comment on ticket: "Applied at 14:32 UTC, instance refresh complete, all 4 instances now m6i.xlarge, plan is clean." Close ticket. Move on.
4.x — The "always-on" minimum-viable cadence
| Cadence | Action | Why |
|---|---|---|
| Every PR | Plan locally + plan on CI — compare | Catch missing-file commits, env drift |
| Daily (test/uat) | Drift-detection job: terraform plan -detailed-exitcode | Detect console clicks before they bite |
| Weekly | Review provider release notes; bump .terraform.lock.hcl in a dedicated PR | Stay current without surprise |
| Monthly | Cost review: infracost diff PR posts | Catch a junior accidentally provisioning db.r6g.16xlarge |
| Quarterly | Module audit: anything not used? anything that should be a module? | Repo doesn't rot |
4.y — "I broke something" recovery flow
- Stop. Don't make it worse with another apply.
- Re-read the latest plan/apply log. Capture the error.
- Decide: roll forward (small fix PR) or roll back (
git revert <sha>on main, merge, CD applies the previous state). - If state and reality disagree, see Part 1 section 9 troubleshooting. Use
terraform import/state rmrather than fighting plan. - Document what happened in the ticket. If customer-impacting, file a postmortem.
05 New-engineer onboarding — Day 1, Week 1, Month 1
Hand this to a new engineer on day one. Senior validates each box as it gets checked. By the end of month one, they should be running BAU tickets unsupervised on test/uat and shadowing on prod.
lf/infra-platform. Confirm push access to terraform-aws-platform.aws sso login --sso-session lf works.~/.aws/config with the tf-nonprod profile (section 2.3).aws sts get-caller-identity --profile tf-nonprod. Role arn should be TerraformDeploy.tfenv, then tfenv install from the repo's .terraform-version.terraform-docs, tflint, tfsec, direnv, gh CLI.terraform fmt -check -recursive; expect no output (clean).uat VPC. Match what you see to the diagram.test — ideally a tag change or a small SG rule.plan review.test tickets that week.test: force-unlock, state rm, import, observe.infra-platform (read-only).modules/secrets).06 Senior's insights — the unwritten rules
Things you'd learn the hard way over five years. Read them now.
One change, one ticket, one commit, one apply. Bundling "while I'm here, let me also…" is how outages happen. Resist it. Open a separate PR.
Treat - destroy on a stateful resource (RDS, EBS, KMS, S3) as an emergency until proven otherwise.
Your job is to bring the three back into agreement. terraform plan is the diff between code and state. The console is the only honest answer for "what is actually running."
for_each over count, almost always.count uses positional indices — remove the second item and items 3, 4, 5 shift to 2, 3, 4 and Terraform recreates them. for_each uses keys — stable across edits.
default_tags.Required tags: Environment, Owner, CostCenter, DataClass, ManagedBy=terraform. Without tags you cannot bill, audit, or page the right team.
Tfvars files end up in Git, in CI logs, in someone's screenshare. Use Secrets Manager / Parameter Store / Vault, fetch at runtime via data sources or TF_VAR_*.
If your module is consumed by other roots, treat its inputs and outputs as public API. Renaming a variable is a breaking change.
"Latest" is fine in test. In prod, "latest" means "your config can change between two applies that ran the same code."
prevent_destroy on the irreplaceable.
resource "aws_rds_cluster" "prod" { lifecycle { prevent_destroy = true ignore_changes = [master_password] } }
create_before_destroy for things behind a load balancer.Default is destroy-then-create. For an ALB target group, that means downtime. Set create_before_destroy = true on launch templates and target groups.
ignore_changes for things you don't own.If autoscaling adjusts desired_capacity, or a deploy tool changes a tag, Terraform fights it. Add the attribute to ignore_changes — sparingly — with a comment explaining why.
terraform plan is read-only and free. Run it constantly.Before lunch, after pulling main, before opening a PR, after reviewing someone else's. The cheapest test in the toolbox.
Data sources re-evaluate every run. A new AMI was published → your launch template wants to update. Pin the AMI in prod or accept the diff.
Use terraform import and terraform state mv, not "let me destroy and recreate to match."
One folder = one state = one env = one apply. Resist the urge to merge envs into one folder for "DRY" — the duplication is intentional and protective.
What the reviewer reads is the plan posted by CI on the PR. If your local plan and CI plan differ, you forgot to commit something.
lf-prod-web-asg, not production_web_autoscaling_group_v2.
Apply happens in CD, with an auditable trail. Even if the build is broken — fix the build.
A nightly terraform plan -detailed-exitcode per env that pages on non-zero. Drift caught at 02:00 is not an outage; drift caught during a 09:00 apply is.
The code shows what. Comments and PR descriptions exist for the why. "Increased to xlarge for load test on Tuesday" is a commit message. "// xlarge" is noise.