Terraform 301 — AWS Infrastructure Engineering
A hands-on training guide for engineers who already know AWS and want to provision it as code. We will go from "what is Terraform?" to a working multi-account, multi-environment layout (prod, prod-support, uat, test) with VPCs, subnets, security groups, IAM, EC2, and an SQL cluster — plus the Git workflow that wraps it.
Files deep-dive (line-by-line:
main.tf, variables.tf, data.tf, locals, outputs), env-vars & SSO credentials, the SOP for creating a new environment, the BAU SOP for editing existing infra, the new-engineer onboarding checklist, and the senior's 20 unwritten rules: open
terraform-301-bau-sop.html
(right-click → open in new tab).
01 What is Terraform & how it fits into IaC
Infrastructure as Code (IaC) is the practice of describing your cloud resources — VPCs, subnets, EC2, RDS, IAM — in declarative text files kept in version control, then having a tool reconcile reality with that description. The benefit is not "scripting AWS faster"; it is making infrastructure reviewable, reproducible, and auditable.
Where Terraform sits
| Style | Tool | What you write | How it runs |
|---|---|---|---|
| Imperative scripts | AWS CLI, boto3, PowerShell | Steps ("create VPC, then subnet…") | You re-run carefully; no built-in idea of "current state" |
| Declarative, AWS-native | CloudFormation, CDK | Desired state in YAML/JSON or code | Runs inside AWS as a Stack |
| Declarative, multi-cloud | Terraform / OpenTofu | Desired state in HCL .tf files | Compares your code to a state file and produces a plan, then applies it |
| Config management | Ansible, Chef, Puppet | Steps for in-OS config | Runs against running servers (complementary, not a replacement) |
.tf files (the desired state) and the terraform.tfstate (the recorded state), then asks the cloud provider to make reality match.Why teams adopt it
- Peer review. Infra changes go through pull requests just like application code.
- Reproducibility. The same module deploys an identical VPC in test, uat, prod-support and prod — only the
.tfvarsdiffers. - Drift detection. If someone clicks-ops a change in the console, the next
terraform planshows it. - Blast-radius control. A single environment lives in its own state file and can be destroyed/rebuilt without touching the others.
terraform binary, everything in this guide applies; substitute tofu if you have switched.02 Core concepts you must internalise
Provider
A plugin that talks to an API. hashicorp/aws talks to AWS. You configure region and credentials on it. You can have multiple aliased providers — that's how we hit two AWS accounts from one root.
Resource
A managed thing. aws_vpc, aws_subnet, aws_security_group. Each resource has a type and a local name you reference elsewhere as aws_vpc.main.id.
Data source
A read-only lookup. data "aws_ami" "al2023" finds an AMI without managing it. Use these for things you don't own (e.g. an account you only read from).
Variable
Inputs declared in variables.tf. Values come from *.tfvars, -var flags, env vars (TF_VAR_name), or defaults.
Output
What the module exposes after apply — e.g. the VPC id, subnet ids. Other configurations (or humans) consume them.
State
The JSON file that records what Terraform created. Source of truth for the diff engine. In real environments it lives in S3 with a DynamoDB lock, never on a laptop.
Module
A reusable folder of .tf files with inputs and outputs. "Network module", "ec2 module", "rds module". Modules are how you stop copy-pasting between environments.
Backend
Where state is stored. Configured once per root. We use the S3 + DynamoDB backend so multiple engineers can collaborate safely.
Workspace
A named slot inside one backend. Useful for very small setups. For real multi-env work we prefer separate folders + backends per env — clearer blast radius and IAM.
The lifecycle in five commands
terraform.tfstate by hand. Use terraform state mv, terraform state rm, terraform import. We cover these in Troubleshooting.03 The files: main.tf, variables.tf, *.tfvars
Terraform doesn't care what you call your files — it concatenates every .tf in a folder. But there is a strong convention every team should follow:
| File | What goes in it | Edited per env? |
|---|---|---|
main.tf | The actual resources and module calls (the "what to build"). | No — same code for every env. |
variables.tf | Declarations of inputs — name, type, description, default, validation. | No. |
outputs.tf | Things to expose after apply (VPC id, subnet ids, ALB DNS). | No. |
providers.tf | Provider config — region, alias, assume-role. | Sometimes (account id changes). |
backend.tf | Where state lives (S3 bucket, key, DynamoDB lock table). | Yes — the state key is per-env. |
versions.tf | Required Terraform & provider versions. | No. |
terraform.tfvars | Default variable values (auto-loaded). | Avoid in multi-env — prefer named files. |
prod.tfvars, test.tfvars… | Values per environment — instance sizes, CIDR blocks, account ids. | Yes — this is the per-env knob. |
Tiny example to make it concrete
variable "environment" {
type = string
description = "prod | prod-support | uat | test"
validation {
condition = contains(["prod","prod-support","uat","test"], var.environment)
error_message = "environment must be one of prod, prod-support, uat, test."
}
}
variable "vpc_cidr" {
type = string
default = "10.0.0.0/16"
}
variable "instance_type" {
type = string
default = "t3.small"
}
variable "common_tags" {
type = map(string)
default = {}
}
locals {
name_prefix = "lf-${var.environment}"
tags = merge(var.common_tags, {
Environment = var.environment
ManagedBy = "terraform"
})
}
module "network" {
source = "../../modules/network"
name = "${local.name_prefix}-vpc"
cidr_block = var.vpc_cidr
tags = local.tags
}
# test.tfvars - non-prod account, smallest footprint
environment = "test"
vpc_cidr = "10.30.0.0/16"
instance_type = "t3.small"
common_tags = {
CostCenter = "CC-1042"
Owner = "infra-platform"
DataClass = "internal"
}
# prod.tfvars - prod account, hardened sizing
environment = "prod"
vpc_cidr = "10.10.0.0/16"
instance_type = "m6i.large"
common_tags = {
CostCenter = "CC-1001"
Owner = "infra-platform"
DataClass = "confidential"
Compliance = "sox"
}
output "vpc_id" {
value = module.network.vpc_id
description = "VPC id created in this environment"
}
output "private_subnet_ids" {
value = module.network.private_subnet_ids
}
-var on the CLI > -var-file > *.auto.tfvars > terraform.tfvars > TF_VAR_* env vars > default in variables.tf. Knowing this lets you override safely in CI.04 Multi-account × multi-environment layout
You described the real situation: two AWS accounts (a prod account and a non-prod account) hosting four logical environments:
Prod AWS account
- prod — customer-facing workloads
- prod-support — jump hosts, monitoring, backup tooling that needs to see prod
Non-prod AWS account
- uat — user acceptance, prod-shaped data
- test — integration / dev sandbox, smallest sizing
Recommended directory structure
test cannot touch prod's state. Workspaces share the backend — risky for production.Wiring two AWS accounts: provider aliases + assume-role
Each environment's root assumes a deployment role in the right account. Engineers' local AWS profile only needs permission to assume those roles; they never carry account access keys.
# envs/prod/providers.tf
provider "aws" {
region = "us-east-1"
assume_role {
role_arn = "arn:aws:iam::111111111111:role/TerraformDeploy"
session_name = "tf-prod-${terraform.workspace}"
}
default_tags {
tags = {
Environment = "prod"
ManagedBy = "terraform"
}
}
}
# Optional second alias to read-only data from non-prod account
# (e.g. peering, AMI sharing)
provider "aws" {
alias = "nonprod_ro"
region = "us-east-1"
assume_role {
role_arn = "arn:aws:iam::222222222222:role/TerraformReadOnly"
}
}
Wiring backend (S3) per environment
# envs/prod/backend.tf
terraform {
backend "s3" {
bucket = "lf-tfstate-prod-111111111111"
key = "prod/network.tfstate"
region = "us-east-1"
dynamodb_table = "lf-tfstate-locks"
encrypt = true
}
}
The four envs differ in three places only: backend key, provider role_arn, and the *.tfvars values. Everything else — main.tf, modules, code review — is shared.
Per-env knobs at a glance
| Knob | prod | prod-support | uat | test |
|---|---|---|---|---|
| AWS account | 111111111111 | 111111111111 | 222222222222 | 222222222222 |
| VPC CIDR | 10.10.0.0/16 | 10.11.0.0/16 | 10.20.0.0/16 | 10.30.0.0/16 |
| EC2 size | m6i.large | t3.medium | t3.medium | t3.small |
| RDS Multi-AZ | true | n/a | true | false |
| Backups | 35d, PITR | n/a | 14d | 1d |
| NAT | HA per-AZ | HA per-AZ | 1 NAT | 1 NAT |
| Approvers | 2 senior | 2 senior | 1 | 1 |
05 AWS architecture — the picture you'll provision
Two diagrams. The first is the org-wide layout: two AWS accounts hosting four environments, all driven by the same Terraform repo. The second is one VPC zoomed in — the actual networking, compute, and data plane Terraform creates.
5.1 — Two AWS accounts × four environments
5.2 — Anatomy of one VPC (zoomed in)
This is what you actually write Terraform for. Trace the request path: client → ALB SG → web SG → app SG → db SG. Trace the IAM path: instance profile → role → managed + inline policies.
06 Real Terraform code — VPC, SG, subnets, IAM, EC2, RDS
Below is the kind of code you'd actually write. Each tab is a real, copy-pasteable snippet. The convention: modules own the resources, envs/<env>/main.tf just calls those modules with environment-specific inputs.
# modules/network/main.tf - builds VPC + IGW + 2 public + 2 private subnets + NAT
resource "aws_vpc" "this" {
cidr_block = var.cidr_block
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(var.tags, { Name = var.name })
}
resource "aws_internet_gateway" "this" {
vpc_id = aws_vpc.this.id
tags = merge(var.tags, { Name = "${var.name}-igw" })
}
data "aws_availability_zones" "available" {
state = "available"
}
resource "aws_subnet" "public" {
for_each = { for idx, az in slice(data.aws_availability_zones.available.names, 0, 2) : az => idx }
vpc_id = aws_vpc.this.id
availability_zone = each.key
cidr_block = cidrsubnet(var.cidr_block, 8, each.value + 1)
map_public_ip_on_launch = true
tags = merge(var.tags, {
Name = "${var.name}-public-${each.key}"
Tier = "public"
})
}
resource "aws_subnet" "private" {
for_each = { for idx, az in slice(data.aws_availability_zones.available.names, 0, 2) : az => idx }
vpc_id = aws_vpc.this.id
availability_zone = each.key
cidr_block = cidrsubnet(var.cidr_block, 8, each.value + 11)
tags = merge(var.tags, {
Name = "${var.name}-private-${each.key}"
Tier = "private"
})
}
resource "aws_eip" "nat" {
for_each = aws_subnet.public
domain = "vpc"
}
resource "aws_nat_gateway" "this" {
for_each = aws_subnet.public
allocation_id = aws_eip.nat[each.key].id
subnet_id = each.value.id
tags = merge(var.tags, { Name = "${var.name}-nat-${each.key}" })
}
# modules/network/outputs.tf
output "vpc_id" { value = aws_vpc.this.id }
output "public_subnet_ids" { value = [for s in aws_subnet.public : s.id] }
output "private_subnet_ids" { value = [for s in aws_subnet.private : s.id] }
# modules/security/main.tf - the SG chain ALB -> web -> app -> db
# Use aws_vpc_security_group_*_rule (Terraform AWS provider 5.x) instead of inline rules.
# That way each rule is its own resource - cleaner diffs, no churn.
resource "aws_security_group" "alb" {
name = "${var.name_prefix}-alb"
description = "ALB ingress from internet"
vpc_id = var.vpc_id
}
resource "aws_vpc_security_group_ingress_rule" "alb_https" {
security_group_id = aws_security_group.alb.id
cidr_ipv4 = "0.0.0.0/0"
from_port = 443
to_port = 443
ip_protocol = "tcp"
}
resource "aws_security_group" "web" {
name = "${var.name_prefix}-web"
description = "Web tier - only ALB can reach it"
vpc_id = var.vpc_id
}
resource "aws_vpc_security_group_ingress_rule" "web_from_alb" {
security_group_id = aws_security_group.web.id
referenced_security_group_id = aws_security_group.alb.id # <- by id, not CIDR
from_port = 80
to_port = 80
ip_protocol = "tcp"
}
resource "aws_security_group" "app" { name = "${var.name_prefix}-app" vpc_id = var.vpc_id }
resource "aws_vpc_security_group_ingress_rule" "app_from_web" {
security_group_id = aws_security_group.app.id
referenced_security_group_id = aws_security_group.web.id
from_port = 8080 to_port = 8080 ip_protocol = "tcp"
}
resource "aws_security_group" "db" { name = "${var.name_prefix}-db" vpc_id = var.vpc_id }
resource "aws_vpc_security_group_ingress_rule" "db_from_app" {
security_group_id = aws_security_group.db.id
referenced_security_group_id = aws_security_group.app.id
from_port = 3306 to_port = 3306 ip_protocol = "tcp"
}
# Egress: explicit, not implicit "all"
# Each tier gets exactly what it needs.
resource "aws_vpc_security_group_egress_rule" "app_to_db" {
security_group_id = aws_security_group.app.id
referenced_security_group_id = aws_security_group.db.id
from_port = 3306 to_port = 3306 ip_protocol = "tcp"
}
# modules/iam/main.tf - EC2 instance profile (least privilege)
data "aws_iam_policy_document" "assume_ec2" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ec2.amazonaws.com"]
}
}
}
resource "aws_iam_role" "ec2_app" {
name = "${var.name_prefix}-ec2-app"
assume_role_policy = data.aws_iam_policy_document.assume_ec2.json
tags = var.tags
}
# Managed policy: SSM Session Manager (no SSH needed, ever)
resource "aws_iam_role_policy_attachment" "ssm" {
role = aws_iam_role.ec2_app.name
policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
# Inline policy: app-specific permissions
data "aws_iam_policy_document" "app_inline" {
statement {
sid = "ReadAppArtifacts"
actions = ["s3:GetObject", "s3:ListBucket"]
resources = [
"arn:aws:s3:::${var.artifacts_bucket}",
"arn:aws:s3:::${var.artifacts_bucket}/*",
]
}
statement {
sid = "DecryptWithAppKey"
actions = ["kms:Decrypt"]
resources = [var.app_kms_key_arn]
}
}
resource "aws_iam_role_policy" "app_inline" {
name = "app-inline"
role = aws_iam_role.ec2_app.id
policy = data.aws_iam_policy_document.app_inline.json
}
# Bind role to an instance profile - this is what you attach to EC2
resource "aws_iam_instance_profile" "ec2_app" {
name = "${var.name_prefix}-ec2-app"
role = aws_iam_role.ec2_app.name
}
output "instance_profile_name" { value = aws_iam_instance_profile.ec2_app.name }
# Key pair (break-glass only - normal access is via SSM)
resource "tls_private_key" "break_glass" {
algorithm = "ED25519"
}
resource "aws_key_pair" "break_glass" {
key_name = "${var.name_prefix}-break-glass"
public_key = tls_private_key.break_glass.public_key_openssh
}
# Store the private key in SSM Parameter Store, not on a laptop
resource "aws_ssm_parameter" "break_glass_priv" {
name = "/${var.environment}/keys/break-glass"
type = "SecureString"
value = tls_private_key.break_glass.private_key_pem
tags = var.tags
}
# Extra EBS volume for app data (separate from root)
resource "aws_ebs_volume" "app_data" {
availability_zone = var.az
size = var.data_volume_size_gb # e.g. 100
type = "gp3"
iops = 3000
throughput = 125
encrypted = true
kms_key_id = var.app_kms_key_arn
tags = merge(var.tags, { Name = "${var.name_prefix}-data" })
}
resource "aws_volume_attachment" "app_data" {
device_name = "/dev/xvdf"
volume_id = aws_ebs_volume.app_data.id
instance_id = aws_instance.app.id
}
# modules/compute/main.tf - launch template + autoscaling group + ALB target group
data "aws_ami" "al2023" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-*-x86_64"]
}
}
resource "aws_launch_template" "app" {
name_prefix = "${var.name_prefix}-app-"
image_id = data.aws_ami.al2023.id
instance_type = var.instance_type
key_name = var.key_pair_name
vpc_security_group_ids = [var.app_sg_id]
iam_instance_profile {
name = var.instance_profile_name
}
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 30
volume_type = "gp3"
encrypted = true
}
}
user_data = base64encode(templatefile("${path.module}/userdata.sh.tpl", {
environment = var.environment
}))
tag_specifications {
resource_type = "instance"
tags = merge(var.tags, { Name = "${var.name_prefix}-app" })
}
}
resource "aws_autoscaling_group" "app" {
name = "${var.name_prefix}-app-asg"
vpc_zone_identifier = var.private_subnet_ids
min_size = var.min_size
max_size = var.max_size
desired_capacity = var.desired_capacity
target_group_arns = [aws_lb_target_group.app.arn]
health_check_type = "ELB"
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
tag {
key = "Environment"
value = var.environment
propagate_at_launch = true
}
}
# modules/database/main.tf - Aurora MySQL cluster (writer + 2 readers, Multi-AZ)
resource "aws_db_subnet_group" "this" {
name = "${var.name_prefix}-db"
subnet_ids = var.private_subnet_ids
tags = var.tags
}
resource "aws_rds_cluster_parameter_group" "this" {
name = "${var.name_prefix}-aurora-mysql"
family = "aurora-mysql8.0"
parameter {
name = "binlog_format"
value = "ROW"
}
}
resource "aws_rds_cluster" "this" {
cluster_identifier = "${var.name_prefix}-aurora"
engine = "aurora-mysql"
engine_version = "8.0.mysql_aurora.3.05.2"
database_name = var.db_name
master_username = var.db_user
master_password = random_password.db.result
db_subnet_group_name = aws_db_subnet_group.this.name
vpc_security_group_ids = [var.db_sg_id]
db_cluster_parameter_group_name = aws_rds_cluster_parameter_group.this.name
storage_encrypted = true
kms_key_id = var.kms_key_arn
backup_retention_period = var.backup_retention_days
preferred_backup_window = "03:00-04:00"
deletion_protection = var.environment == "prod"
skip_final_snapshot = var.environment != "prod"
tags = var.tags
}
resource "aws_rds_cluster_instance" "this" {
count = var.cluster_size # prod=3, uat=2, test=1
identifier = "${var.name_prefix}-aurora-${count.index}"
cluster_identifier = aws_rds_cluster.this.id
instance_class = var.db_instance_class
engine = aws_rds_cluster.this.engine
engine_version = aws_rds_cluster.this.engine_version
db_subnet_group_name = aws_db_subnet_group.this.name
tags = var.tags
}
resource "random_password" "db" {
length = 32
special = true
}
resource "aws_secretsmanager_secret" "db" {
name = "${var.name_prefix}/db/master"
}
resource "aws_secretsmanager_secret_version" "db" {
secret_id = aws_secretsmanager_secret.db.id
secret_string = jsonencode({ username = var.db_user, password = random_password.db.result })
}
# envs/uat/main.tf - the env root just composes modules with uat values
module "network" {
source = "../../modules/network"
name = "lf-${var.environment}"
cidr_block = var.vpc_cidr
tags = local.tags
}
module "security" {
source = "../../modules/security"
name_prefix = "lf-${var.environment}"
vpc_id = module.network.vpc_id
}
module "iam" {
source = "../../modules/iam"
name_prefix = "lf-${var.environment}"
artifacts_bucket = "lf-artifacts-${var.environment}"
app_kms_key_arn = aws_kms_key.app.arn
tags = local.tags
}
module "app" {
source = "../../modules/compute"
name_prefix = "lf-${var.environment}"
environment = var.environment
private_subnet_ids = module.network.private_subnet_ids
app_sg_id = module.security.app_sg_id
instance_profile_name = module.iam.instance_profile_name
instance_type = var.instance_type
min_size = 2
max_size = 4
desired_capacity = 2
tags = local.tags
}
module "db" {
source = "../../modules/database"
name_prefix = "lf-${var.environment}"
environment = var.environment
private_subnet_ids = module.network.private_subnet_ids
db_sg_id = module.security.db_sg_id
kms_key_arn = aws_kms_key.app.arn
cluster_size = 2
db_instance_class = "db.r6g.large"
backup_retention_days = 14
db_name = "appdb"
db_user = "appadmin"
tags = local.tags
}
07 Interactive walkthrough — provision a real change
Pick a scenario below and step through it. The terminal simulates what you would actually see when running these commands against the layout in section 4. Try the Terraform first apply path first, then the Ticketed change path which weaves Git in.
init → fmt → validate → plan → review → apply.08 Git workflow — from a ticket to merged main
Terraform is only as safe as the change-management process around it. The pattern below is the most boring, most reliable one I've seen work at scale. Trunk-based, short-lived branches, every plan visible on the PR.
The flow
Branch naming, commit messages, PR title
| Item | Convention | Example |
|---|---|---|
| Branch | <ticket-id-lower>-<short-slug> | infra-1842-uat-partner-cidr |
| Commit | TICKET: imperative summary ≤ 72 chars | INFRA-1842: open uat ALB to partner 203.0.113.0/24 |
| PR title | Same as commit | (GitHub auto-fills it) |
| PR body | What / why / blast-radius / plan output / rollback | see template below |
Useful Git commands for infra work
Files every Terraform repo must have
.gitignore
# Terraform internals - never commit these
.terraform/
.terraform.lock.hcl # actually, DO commit this one (next card)
*.tfstate
*.tfstate.*
*.tfplan
tfplan
crash.log
crash.*.log
# Local overrides
*.auto.tfvars
override.tf
override.tf.json
# IDE / OS
.idea/
.vscode/
.DS_Store
*.tfstate. It contains secrets in plaintext (RDS passwords, etc.). State lives in S3 + KMS, not in Git..terraform.lock.hcl — commit it
Pins the exact provider versions used to apply. Commit it so every engineer + CI runs identical providers. Without it, a new bug-fix release of aws provider can silently change your plan.
CODEOWNERS
# Default owner
* @lf/infra-platform
# Modules - any senior engineer can review
/modules/ @lf/infra-platform
# Per-env: prod requires senior + security
/envs/prod/ @lf/infra-seniors @lf/security
/envs/prod-support/ @lf/infra-seniors
/envs/uat/ @lf/infra-platform
/envs/test/ @lf/infra-platform
PR template (.github/PULL_REQUEST_TEMPLATE.md)
## Ticket
INFRA-####
## What
<1-2 sentences>
## Why
<business or compliance driver>
## Blast radius
- Envs touched: test / uat / prod-support / prod
- Resources affected: + N, ~ M, - K
## Plan output
<paste or link CI artifact>
## Rollback plan
git revert + apply, OR `terraform apply` of previous tag
Protecting your state bucket
- S3 bucket: versioning ON, KMS encryption, block public access, MFA Delete on prod bucket.
- DynamoDB lock table: provisioned, with point-in-time recovery.
- IAM: only the
TerraformDeployrole can write; engineers get read-only. - Cross-account: prod state bucket lives in a third "tooling" account — or in the prod account with a tight bucket policy.
09 When something goes wrong — identify & recover
The hardest part of Terraform is not writing it — it is recovering when reality and state disagree. Here are the most common failure modes and how to think about them.
| Symptom | What it means | How to fix |
|---|---|---|
Error acquiring the state lock |
Another plan/apply is running, OR a previous one crashed without releasing the lock. |
Find who has the lock (the error shows their identity). Wait or talk to them. Only as last resort: terraform force-unlock <LOCK_ID>. |
| Plan wants to destroy a resource you didn't change | Either someone removed it from .tf, or the resource address changed (rename, module move). |
If it was a rename: terraform state mv old.addr new.addr. If you want to keep but not manage: terraform state rm. |
| Plan wants to create something that already exists in AWS | Resource was created in console (drift), or a previous apply lost it from state. | terraform import <addr> <aws-id>, then re-plan to confirm clean diff. |
InvalidParameterValue: VPC has dependencies on destroy |
Resources outside Terraform's state (manually created ENIs, peerings) are blocking deletion. | Find them in console, decide to import or delete out-of-band, then retry. |
| Plan output is huge / nondeterministic | Often: a tag set differs (e.g. AWS auto-injects a tag), or an attribute is computed. | Use ignore_changes in lifecycle for known noise. Don't silence everything — you'll miss real diffs. |
| Apply fails halfway | Some resources got created, some didn't. State reflects what succeeded. | Re-run plan. Terraform will only create what's missing. Don't init -reconfigure in panic. |
| Provider version drift across the team | Someone ran without committing .terraform.lock.hcl. |
Commit the lock file. Use terraform init -upgrade intentionally, in a dedicated PR. |
| Cycle in module graph | Two resources reference each other. | Break with a third resource (e.g. SG rule referencing two SGs by id) or via a data lookup. |
Git merge conflict in *.tfvars or main.tf |
Two PRs touched the same file. | git rebase origin/main, edit, terraform fmt, terraform validate, force-push the branch. Plan again before merging. |
| Force-pushed over someone else's commit | Lost work in the branch (not main). | git reflog to find the lost commit, git cherry-pick it back. |
terraform plan shows resources you never wrote |
You're pointing at the wrong backend / state file. Common when copy-pasting backend.tf. |
Check .terraform/terraform.tfstate → it points to the bucket. Stop, fix the backend block, terraform init -reconfigure. |
The "is it Terraform or Git?" decision tree
terraform plan against an empty branch first to see what state thinks reality looks like. plan never changes anything.10 Cheatsheet
Terraform commands you'll use weekly
Git commands for infra PRs
Mental model recap
1. Code is desired state
Your .tf files describe what you want. Terraform's job is to make AWS match.
2. State is recorded reality
S3 + DynamoDB. Never on a laptop. Never edited by hand.
3. Modules > copy-paste
Same module, different tfvars per env. That's how you get four reproducible environments.
4. Plan before apply
Read every line of the plan. If you don't understand a diff, stop.
5. Git is the audit trail
Every infra change is a PR. CI plan in PR comments. CODEOWNERS gate prod.
6. Blast radius via folders
One folder = one state = one env. Test can never break prod by accident.
11 Files deep-dive — what each one does, line by line
New engineers often look at a Terraform folder and see seven files with confusing names. Here is what each is for, why it exists, and what goes inside it. Terraform reads every .tf file in the current folder, alphabetically, and stitches them together as one big config. The filenames are pure convention — but follow the convention because that is what every reviewer expects.
11.1 — versions.tf · the contract
This file pins the Terraform version and provider versions. It is the first file the senior writes and the last one to change.
terraform {
required_version = ">= 1.6.0, < 2.0.0" # your binary must be in this range
required_providers {
aws = {
source = "hashicorp/aws" # registry namespace
version = "~> 5.74" # 5.74.x ok, 6.x not ok
}
random = { source = "hashicorp/random", version = "~> 3.6" }
tls = { source = "hashicorp/tls", version = "~> 4.0" }
}
}
11.2 — providers.tf · how Terraform talks to AWS
provider "aws" { # the default (un-aliased) provider
region = var.aws_region
assume_role { # engineer's SSO → deploy role
role_arn = var.deploy_role_arn
session_name = "tf-${var.environment}"
}
default_tags { # tag EVERY resource automatically
tags = {
Environment = var.environment
ManagedBy = "terraform"
Repo = "terraform-aws-platform"
}
}
}
provider "aws" { # aliased provider, e.g. another region
alias = "us_west"
region = "us-west-2"
}
# Inside a resource use: provider = aws.us_west
11.3 — backend.tf · where state lives
terraform {
backend "s3" {
bucket = "lf-tfstate-nonprod-222" # the bucket per account
key = "uat/network.tfstate" # <-- the per-env knob
region = "us-east-1"
dynamodb_table = "lf-tfstate-locks"
encrypt = true
}
}
backend.tf per env folder (recommended) or use terraform init -backend-config=....11.4 — variables.tf · declarations only, no values
This file declares what the config accepts as input. It never holds values. Values come from *.tfvars, -var, or TF_VAR_* env vars (covered in section 12).
# variables.tf - declarations
variable "environment" {
type = string # required type
description = "Logical env name: prod | prod-support | uat | test"
validation { # enforce shape at plan time
condition = contains(["prod","prod-support","uat","test"], var.environment)
error_message = "environment must be prod, prod-support, uat, or test."
}
}
variable "vpc_cidr" {
type = string
description = "VPC CIDR block, /16"
default = "10.0.0.0/16" # default = optional input
}
variable "db_password" {
type = string
sensitive = true # hides value in plan output and outputs
description = "Master DB password (typically supplied by Secrets Manager, not tfvars)"
}
variable "app_servers" { # complex types are first-class
type = list(object({
name = string
instance_type = string
public = bool
}))
default = []
description = "App tier sizing per server"
}
variable "common_tags" {
type = map(string)
default = {}
}
| Variable attribute | What it does |
|---|---|
type | string, number, bool, list(...), set(...), map(...), object({...}), tuple([...]) |
description | Shows up in terraform plan hints and module docs. Always write it. |
default | Optional. If absent, value MUST be supplied at plan time. |
sensitive | Redacts from plan/apply output. Still saved to state — protect state. |
nullable | false means callers cannot pass null. |
validation | Reject bad values at plan time with a friendly error. |
11.5 — data.tf · read-only lookups
Data sources let you query AWS without managing the resource. Latest AMI, an existing Route53 zone, the caller identity. Output of a data source is fresh on every run; that is good for AMIs (you want the newest) but means you can get unexpected diffs — pin AMIs in production.
# data.tf
data "aws_caller_identity" "current" {} # who am I? - returns account_id, arn, user_id
data "aws_region" "current" {} # the region you're in
data "aws_availability_zones" "available" { # list of AZs in the region
state = "available"
}
data "aws_ami" "al2023" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-*-x86_64"]
}
}
data "aws_route53_zone" "corp" { # reference an existing zone
name = "example.internal."
private_zone = true
}
# Use them anywhere as data.<type>.<name>.<attr>
# e.g. data.aws_caller_identity.current.account_id
resource. If you only want to read it (because someone else owns it), make it a data.11.6 — main.tf · the table of contents
Despite the name, main.tf rarely contains the bulk of your code in a real repo. Resources live in modules. The env-level main.tf is just "this env composes these modules".
# envs/uat/main.tf
# 1. local values - computed once, used in many places
locals {
name_prefix = "lf-${var.environment}"
account_id = data.aws_caller_identity.current.account_id
tags = merge(var.common_tags, {
Environment = var.environment
Account = local.account_id
ManagedBy = "terraform"
})
}
# 2. module composition - the actual stack for this env
module "network" {
source = "../../modules/network"
name = "${local.name_prefix}-vpc"
cidr_block = var.vpc_cidr
tags = local.tags
}
module "security" {
source = "../../modules/security"
name_prefix = local.name_prefix
vpc_id = module.network.vpc_id # cross-module reference
tags = local.tags
}
# 3. one-off resources are fine here too if they're truly env-specific
resource "aws_kms_key" "app" {
description = "App KMS key for ${var.environment}"
deletion_window_in_days = var.environment == "prod" ? 30 : 7
tags = local.tags
}
main.tf when it grows. When the file gets past ~150 lines, split by concern: network.tf, compute.tf, database.tf, iam.tf. Terraform doesn't care — it concatenates anyway.