Terraform 301 — AWS Infrastructure Engineering

A hands-on training guide for engineers who already know AWS and want to provision it as code. We will go from "what is Terraform?" to a working multi-account, multi-environment layout (prod, prod-support, uat, test) with VPCs, subnets, security groups, IAM, EC2, and an SQL cluster — plus the Git workflow that wraps it.

Terraform 1.6+ AWS Provider 5.x 2 accounts × 4 environments 3-tier + standalone + SQL cluster
Part 2 · SOPs & BAU is in a companion file.
Files deep-dive (line-by-line: main.tf, variables.tf, data.tf, locals, outputs), env-vars & SSO credentials, the SOP for creating a new environment, the BAU SOP for editing existing infra, the new-engineer onboarding checklist, and the senior's 20 unwritten rules: open terraform-301-bau-sop.html (right-click → open in new tab).

01 What is Terraform & how it fits into IaC

Infrastructure as Code (IaC) is the practice of describing your cloud resources — VPCs, subnets, EC2, RDS, IAM — in declarative text files kept in version control, then having a tool reconcile reality with that description. The benefit is not "scripting AWS faster"; it is making infrastructure reviewable, reproducible, and auditable.

Where Terraform sits

StyleToolWhat you writeHow it runs
Imperative scriptsAWS CLI, boto3, PowerShellSteps ("create VPC, then subnet…")You re-run carefully; no built-in idea of "current state"
Declarative, AWS-nativeCloudFormation, CDKDesired state in YAML/JSON or codeRuns inside AWS as a Stack
Declarative, multi-cloudTerraform / OpenTofuDesired state in HCL .tf filesCompares your code to a state file and produces a plan, then applies it
Config managementAnsible, Chef, PuppetSteps for in-OS configRuns against running servers (complementary, not a replacement)
Mental model: Terraform is a diff engine. It reads your .tf files (the desired state) and the terraform.tfstate (the recorded state), then asks the cloud provider to make reality match.

Why teams adopt it

  • Peer review. Infra changes go through pull requests just like application code.
  • Reproducibility. The same module deploys an identical VPC in test, uat, prod-support and prod — only the .tfvars differs.
  • Drift detection. If someone clicks-ops a change in the console, the next terraform plan shows it.
  • Blast-radius control. A single environment lives in its own state file and can be destroyed/rebuilt without touching the others.
Terraform vs OpenTofu: OpenTofu is the open-source fork after HashiCorp's BSL change. Commands and HCL are compatible. If your team uses terraform binary, everything in this guide applies; substitute tofu if you have switched.

02 Core concepts you must internalise

Provider

A plugin that talks to an API. hashicorp/aws talks to AWS. You configure region and credentials on it. You can have multiple aliased providers — that's how we hit two AWS accounts from one root.

Resource

A managed thing. aws_vpc, aws_subnet, aws_security_group. Each resource has a type and a local name you reference elsewhere as aws_vpc.main.id.

Data source

A read-only lookup. data "aws_ami" "al2023" finds an AMI without managing it. Use these for things you don't own (e.g. an account you only read from).

Variable

Inputs declared in variables.tf. Values come from *.tfvars, -var flags, env vars (TF_VAR_name), or defaults.

Output

What the module exposes after apply — e.g. the VPC id, subnet ids. Other configurations (or humans) consume them.

State

The JSON file that records what Terraform created. Source of truth for the diff engine. In real environments it lives in S3 with a DynamoDB lock, never on a laptop.

Module

A reusable folder of .tf files with inputs and outputs. "Network module", "ec2 module", "rds module". Modules are how you stop copy-pasting between environments.

Backend

Where state is stored. Configured once per root. We use the S3 + DynamoDB backend so multiple engineers can collaborate safely.

Workspace

A named slot inside one backend. Useful for very small setups. For real multi-env work we prefer separate folders + backends per env — clearer blast radius and IAM.

The lifecycle in five commands

terraform initterraform init
terraform fmt -check && validateterraform fmt -check && terraform validate
terraform planterraform plan -var-file=test.tfvars -out=tfplan
terraform applyterraform apply tfplan
terraform destroyterraform destroy -var-file=test.tfvars
terraform state listterraform state list
Never edit terraform.tfstate by hand. Use terraform state mv, terraform state rm, terraform import. We cover these in Troubleshooting.

03 The files: main.tf, variables.tf, *.tfvars

Terraform doesn't care what you call your files — it concatenates every .tf in a folder. But there is a strong convention every team should follow:

FileWhat goes in itEdited per env?
main.tfThe actual resources and module calls (the "what to build").No — same code for every env.
variables.tfDeclarations of inputs — name, type, description, default, validation.No.
outputs.tfThings to expose after apply (VPC id, subnet ids, ALB DNS).No.
providers.tfProvider config — region, alias, assume-role.Sometimes (account id changes).
backend.tfWhere state lives (S3 bucket, key, DynamoDB lock table).Yes — the state key is per-env.
versions.tfRequired Terraform & provider versions.No.
terraform.tfvarsDefault variable values (auto-loaded).Avoid in multi-env — prefer named files.
prod.tfvars, test.tfvarsValues per environment — instance sizes, CIDR blocks, account ids.Yes — this is the per-env knob.

Tiny example to make it concrete

variable "environment" {
  type        = string
  description = "prod | prod-support | uat | test"
  validation {
    condition     = contains(["prod","prod-support","uat","test"], var.environment)
    error_message = "environment must be one of prod, prod-support, uat, test."
  }
}

variable "vpc_cidr" {
  type    = string
  default = "10.0.0.0/16"
}

variable "instance_type" {
  type    = string
  default = "t3.small"
}

variable "common_tags" {
  type    = map(string)
  default = {}
}
locals {
  name_prefix = "lf-${var.environment}"
  tags = merge(var.common_tags, {
    Environment = var.environment
    ManagedBy   = "terraform"
  })
}

module "network" {
  source      = "../../modules/network"
  name        = "${local.name_prefix}-vpc"
  cidr_block  = var.vpc_cidr
  tags        = local.tags
}
# test.tfvars  - non-prod account, smallest footprint
environment   = "test"
vpc_cidr      = "10.30.0.0/16"
instance_type = "t3.small"
common_tags = {
  CostCenter = "CC-1042"
  Owner      = "infra-platform"
  DataClass  = "internal"
}
# prod.tfvars  - prod account, hardened sizing
environment   = "prod"
vpc_cidr      = "10.10.0.0/16"
instance_type = "m6i.large"
common_tags = {
  CostCenter = "CC-1001"
  Owner      = "infra-platform"
  DataClass  = "confidential"
  Compliance = "sox"
}
output "vpc_id" {
  value       = module.network.vpc_id
  description = "VPC id created in this environment"
}

output "private_subnet_ids" {
  value = module.network.private_subnet_ids
}
Variable precedence (highest wins): explicit -var on the CLI > -var-file > *.auto.tfvars > terraform.tfvars > TF_VAR_* env vars > default in variables.tf. Knowing this lets you override safely in CI.

04 Multi-account × multi-environment layout

You described the real situation: two AWS accounts (a prod account and a non-prod account) hosting four logical environments:

Prod AWS account

  • prod — customer-facing workloads
  • prod-support — jump hosts, monitoring, backup tooling that needs to see prod

Non-prod AWS account

  • uat — user acceptance, prod-shaped data
  • test — integration / dev sandbox, smallest sizing

Recommended directory structure

terraform-aws-platform/ ├── modules/ # reusable building blocks │ ├── network/ # VPC, subnets, NAT, IGW, routes │ ├── security/ # SGs, NACLs, KMS keys │ ├── iam/ # roles, instance profiles, policies │ ├── compute/ # launch template, ASG, ALB │ └── database/ # RDS Multi-AZ / Aurora ├── envs/ │ ├── prod/ # prod account / prod env │ │ ├── backend.tf │ │ ├── providers.tf │ │ ├── main.tf │ │ ├── variables.tf │ │ ├── outputs.tf │ │ └── prod.tfvars │ ├── prod-support/ # prod account / prod-support env │ │ └── prod-support.tfvars │ ├── uat/ # non-prod account / uat env │ │ └── uat.tfvars │ └── test/ # non-prod account / test env │ └── test.tfvars ├── .github/ │ └── workflows/ # CI: fmt, validate, plan on PR ├── .gitignore ├── .terraform-version # pin via tfenv/tofuenv ├── CODEOWNERS └── README.md
Why folder-per-env, not workspaces? Each env has its own backend (S3 key), its own state, and can use its own IAM role. A blast in test cannot touch prod's state. Workspaces share the backend — risky for production.

Wiring two AWS accounts: provider aliases + assume-role

Each environment's root assumes a deployment role in the right account. Engineers' local AWS profile only needs permission to assume those roles; they never carry account access keys.

# envs/prod/providers.tf
provider "aws" {
  region = "us-east-1"

  assume_role {
    role_arn     = "arn:aws:iam::111111111111:role/TerraformDeploy"
    session_name = "tf-prod-${terraform.workspace}"
  }

  default_tags {
    tags = {
      Environment = "prod"
      ManagedBy   = "terraform"
    }
  }
}

# Optional second alias to read-only data from non-prod account
# (e.g. peering, AMI sharing)
provider "aws" {
  alias  = "nonprod_ro"
  region = "us-east-1"
  assume_role {
    role_arn = "arn:aws:iam::222222222222:role/TerraformReadOnly"
  }
}

Wiring backend (S3) per environment

# envs/prod/backend.tf
terraform {
  backend "s3" {
    bucket         = "lf-tfstate-prod-111111111111"
    key            = "prod/network.tfstate"
    region         = "us-east-1"
    dynamodb_table = "lf-tfstate-locks"
    encrypt        = true
  }
}

The four envs differ in three places only: backend key, provider role_arn, and the *.tfvars values. Everything else — main.tf, modules, code review — is shared.

Per-env knobs at a glance

Knobprodprod-supportuattest
AWS account111111111111111111111111222222222222222222222222
VPC CIDR10.10.0.0/1610.11.0.0/1610.20.0.0/1610.30.0.0/16
EC2 sizem6i.larget3.mediumt3.mediumt3.small
RDS Multi-AZtruen/atruefalse
Backups35d, PITRn/a14d1d
NATHA per-AZHA per-AZ1 NAT1 NAT
Approvers2 senior2 senior11

05 AWS architecture — the picture you'll provision

Two diagrams. The first is the org-wide layout: two AWS accounts hosting four environments, all driven by the same Terraform repo. The second is one VPC zoomed in — the actual networking, compute, and data plane Terraform creates.

5.1 — Two AWS accounts × four environments

AWS Cloud — Organization Eng Engineer + Git terraform CLI PR → review → apply via CI / OIDC role S3 Terraform State Backend S3 bucket (versioned, KMS) DDB DynamoDB lock table IAM Cross-acct Roles TerraformDeploy trusts CI & engineer SSO least-privilege per env Legend: IAM EC2 / ALB RDS / DDB S3 VPC / Networking Prod AWS Account · 111111111111 VPC: prod 10.10.0.0/16 AZ us-east-1a · public subnet ALB NAT IGW 10.10.1.0/24 AZ us-east-1a · private subnet EC2 EC2 RDS 10.10.11.0/24 · web/app/db AZ us-east-1b · public subnet ALB NAT 10.10.2.0/24 AZ us-east-1b · private subnet EC2 EC2 RDS Multi-AZ 10.10.12.0/24 VPC: prod-support 10.11.0.0/16 Public · AZ-a/b NAT IGW Bastion SSM-only Private · tooling tier Standalone EC2 monitoring agent, EBS gp3 S3 backups CloudWatch VPC peering peered to prod for ops access Non-Prod AWS Account · 222222222222 VPC: uat 10.20.0.0/16 Public · AZ-a/b ALB NAT IGW Private · 3-tier app EC2 EC2 RDS Aurora cluster Workload: 3-tier (web/app/db) • ASG behind ALB (web/app) • SG chain: ALB-SG → web-SG → app-SG → db-SG • IAM instance profile: SSM, S3 read • KMS-encrypted EBS volumes tfvars: uat.tfvars smaller sizing, 14d backups backend key: uat/network.tfstate approvers: 1 VPC: test 10.30.0.0/16 Public NAT IGW Private · sandbox / SQL cluster EC2 RDS x3 Multi-AZ Aurora cluster (writer + 2 readers) smallest instance class, 1d backups Workload: SQL cluster + standalone • aws_rds_cluster (Aurora MySQL) • subnet group across 2 AZs • parameter group + KMS • standalone EC2 + EBS for ad-hoc tfvars: test.tfvars cheap, can be torn down nightly backend key: test/network.tfstate approvers: 1 (or auto-merge) assume role: TerraformDeploy (prod) assume role: TerraformDeploy (non-prod)
IAM / Account boundary (prod) Account boundary (non-prod) VPC / IGW Compute (EC2/ALB/NAT) RDS / DynamoDB S3

5.2 — Anatomy of one VPC (zoomed in)

This is what you actually write Terraform for. Trace the request path: client → ALB SG → web SG → app SG → db SG. Trace the IAM path: instance profile → role → managed + inline policies.

VPC 10.20.0.0/16 · uat · us-east-1 Internet IGW Availability Zone us-east-1a Public subnet 10.20.1.0/24 ALB NAT-A Bastion SSM Session Mgr no SSH keys Private subnet 10.20.11.0/24 EC2 web SG: web-sg EC2 app SG: app-sg RDS SG: db-sg IAM Instance Profile Role: ec2-app-role + AmazonSSMManagedInstanceCore + inline: s3:GetObject (artifacts) EBS gp3 100 GiB KMS encrypted aws_ebs_volume Key Pair tls_private_key aws_key_pair break-glass only Availability Zone us-east-1b Public subnet 10.20.2.0/24 ALB NAT-B Private subnet 10.20.12.0/24 EC2 web EC2 app RDS Standby Security Group chain (the big idea) alb-sg :80/443 from 0.0.0.0/0 web-sg :80 from alb-sg app-sg :8080 from web-sg db-sg :3306 from app-sg Always reference SGs by id, never CIDRs. Multi-AZ replication
Traffic flow (orange) SG references (green) RDS replication / data IAM

06 Real Terraform code — VPC, SG, subnets, IAM, EC2, RDS

Below is the kind of code you'd actually write. Each tab is a real, copy-pasteable snippet. The convention: modules own the resources, envs/<env>/main.tf just calls those modules with environment-specific inputs.

# modules/network/main.tf  -  builds VPC + IGW + 2 public + 2 private subnets + NAT
resource "aws_vpc" "this" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags                 = merge(var.tags, { Name = var.name })
}

resource "aws_internet_gateway" "this" {
  vpc_id = aws_vpc.this.id
  tags   = merge(var.tags, { Name = "${var.name}-igw" })
}

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "public" {
  for_each                = { for idx, az in slice(data.aws_availability_zones.available.names, 0, 2) : az => idx }
  vpc_id                  = aws_vpc.this.id
  availability_zone       = each.key
  cidr_block              = cidrsubnet(var.cidr_block, 8, each.value + 1)
  map_public_ip_on_launch = true
  tags = merge(var.tags, {
    Name = "${var.name}-public-${each.key}"
    Tier = "public"
  })
}

resource "aws_subnet" "private" {
  for_each          = { for idx, az in slice(data.aws_availability_zones.available.names, 0, 2) : az => idx }
  vpc_id            = aws_vpc.this.id
  availability_zone = each.key
  cidr_block        = cidrsubnet(var.cidr_block, 8, each.value + 11)
  tags = merge(var.tags, {
    Name = "${var.name}-private-${each.key}"
    Tier = "private"
  })
}

resource "aws_eip" "nat" {
  for_each = aws_subnet.public
  domain   = "vpc"
}

resource "aws_nat_gateway" "this" {
  for_each      = aws_subnet.public
  allocation_id = aws_eip.nat[each.key].id
  subnet_id     = each.value.id
  tags          = merge(var.tags, { Name = "${var.name}-nat-${each.key}" })
}

# modules/network/outputs.tf
output "vpc_id"             { value = aws_vpc.this.id }
output "public_subnet_ids"  { value = [for s in aws_subnet.public  : s.id] }
output "private_subnet_ids" { value = [for s in aws_subnet.private : s.id] }
# modules/security/main.tf  -  the SG chain ALB -> web -> app -> db
# Use aws_vpc_security_group_*_rule (Terraform AWS provider 5.x) instead of inline rules.
# That way each rule is its own resource - cleaner diffs, no churn.

resource "aws_security_group" "alb" {
  name        = "${var.name_prefix}-alb"
  description = "ALB ingress from internet"
  vpc_id      = var.vpc_id
}

resource "aws_vpc_security_group_ingress_rule" "alb_https" {
  security_group_id = aws_security_group.alb.id
  cidr_ipv4         = "0.0.0.0/0"
  from_port         = 443
  to_port           = 443
  ip_protocol       = "tcp"
}

resource "aws_security_group" "web" {
  name        = "${var.name_prefix}-web"
  description = "Web tier - only ALB can reach it"
  vpc_id      = var.vpc_id
}
resource "aws_vpc_security_group_ingress_rule" "web_from_alb" {
  security_group_id            = aws_security_group.web.id
  referenced_security_group_id = aws_security_group.alb.id   # <- by id, not CIDR
  from_port                    = 80
  to_port                      = 80
  ip_protocol                  = "tcp"
}

resource "aws_security_group" "app" { name = "${var.name_prefix}-app"  vpc_id = var.vpc_id }
resource "aws_vpc_security_group_ingress_rule" "app_from_web" {
  security_group_id            = aws_security_group.app.id
  referenced_security_group_id = aws_security_group.web.id
  from_port = 8080  to_port = 8080  ip_protocol = "tcp"
}

resource "aws_security_group" "db"  { name = "${var.name_prefix}-db"   vpc_id = var.vpc_id }
resource "aws_vpc_security_group_ingress_rule" "db_from_app" {
  security_group_id            = aws_security_group.db.id
  referenced_security_group_id = aws_security_group.app.id
  from_port = 3306  to_port = 3306  ip_protocol = "tcp"
}

# Egress: explicit, not implicit "all"
# Each tier gets exactly what it needs.
resource "aws_vpc_security_group_egress_rule" "app_to_db" {
  security_group_id            = aws_security_group.app.id
  referenced_security_group_id = aws_security_group.db.id
  from_port = 3306  to_port = 3306  ip_protocol = "tcp"
}
Why "by id, not CIDR"? Referencing a SG by its id creates a stable, scalable rule. If you scale the ALB to 12 nodes, the web tier rule still works — no IPs changed. CIDR-based rules are the #1 reason teams accidentally open prod to the world.
# modules/iam/main.tf  -  EC2 instance profile (least privilege)

data "aws_iam_policy_document" "assume_ec2" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "ec2_app" {
  name               = "${var.name_prefix}-ec2-app"
  assume_role_policy = data.aws_iam_policy_document.assume_ec2.json
  tags               = var.tags
}

# Managed policy: SSM Session Manager (no SSH needed, ever)
resource "aws_iam_role_policy_attachment" "ssm" {
  role       = aws_iam_role.ec2_app.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

# Inline policy: app-specific permissions
data "aws_iam_policy_document" "app_inline" {
  statement {
    sid     = "ReadAppArtifacts"
    actions = ["s3:GetObject", "s3:ListBucket"]
    resources = [
      "arn:aws:s3:::${var.artifacts_bucket}",
      "arn:aws:s3:::${var.artifacts_bucket}/*",
    ]
  }
  statement {
    sid     = "DecryptWithAppKey"
    actions   = ["kms:Decrypt"]
    resources = [var.app_kms_key_arn]
  }
}

resource "aws_iam_role_policy" "app_inline" {
  name   = "app-inline"
  role   = aws_iam_role.ec2_app.id
  policy = data.aws_iam_policy_document.app_inline.json
}

# Bind role to an instance profile - this is what you attach to EC2
resource "aws_iam_instance_profile" "ec2_app" {
  name = "${var.name_prefix}-ec2-app"
  role = aws_iam_role.ec2_app.name
}

output "instance_profile_name" { value = aws_iam_instance_profile.ec2_app.name }
# Key pair (break-glass only - normal access is via SSM)
resource "tls_private_key" "break_glass" {
  algorithm = "ED25519"
}

resource "aws_key_pair" "break_glass" {
  key_name   = "${var.name_prefix}-break-glass"
  public_key = tls_private_key.break_glass.public_key_openssh
}

# Store the private key in SSM Parameter Store, not on a laptop
resource "aws_ssm_parameter" "break_glass_priv" {
  name  = "/${var.environment}/keys/break-glass"
  type  = "SecureString"
  value = tls_private_key.break_glass.private_key_pem
  tags  = var.tags
}

# Extra EBS volume for app data (separate from root)
resource "aws_ebs_volume" "app_data" {
  availability_zone = var.az
  size              = var.data_volume_size_gb     # e.g. 100
  type              = "gp3"
  iops              = 3000
  throughput        = 125
  encrypted         = true
  kms_key_id        = var.app_kms_key_arn
  tags              = merge(var.tags, { Name = "${var.name_prefix}-data" })
}

resource "aws_volume_attachment" "app_data" {
  device_name = "/dev/xvdf"
  volume_id   = aws_ebs_volume.app_data.id
  instance_id = aws_instance.app.id
}
# modules/compute/main.tf - launch template + autoscaling group + ALB target group

data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

resource "aws_launch_template" "app" {
  name_prefix            = "${var.name_prefix}-app-"
  image_id               = data.aws_ami.al2023.id
  instance_type          = var.instance_type
  key_name               = var.key_pair_name
  vpc_security_group_ids = [var.app_sg_id]

  iam_instance_profile {
    name = var.instance_profile_name
  }

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size = 30
      volume_type = "gp3"
      encrypted   = true
    }
  }

  user_data = base64encode(templatefile("${path.module}/userdata.sh.tpl", {
    environment = var.environment
  }))

  tag_specifications {
    resource_type = "instance"
    tags          = merge(var.tags, { Name = "${var.name_prefix}-app" })
  }
}

resource "aws_autoscaling_group" "app" {
  name                = "${var.name_prefix}-app-asg"
  vpc_zone_identifier = var.private_subnet_ids
  min_size            = var.min_size
  max_size            = var.max_size
  desired_capacity    = var.desired_capacity
  target_group_arns   = [aws_lb_target_group.app.arn]
  health_check_type   = "ELB"

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  tag {
    key                 = "Environment"
    value               = var.environment
    propagate_at_launch = true
  }
}
# modules/database/main.tf - Aurora MySQL cluster (writer + 2 readers, Multi-AZ)

resource "aws_db_subnet_group" "this" {
  name       = "${var.name_prefix}-db"
  subnet_ids = var.private_subnet_ids
  tags       = var.tags
}

resource "aws_rds_cluster_parameter_group" "this" {
  name   = "${var.name_prefix}-aurora-mysql"
  family = "aurora-mysql8.0"
  parameter {
    name  = "binlog_format"
    value = "ROW"
  }
}

resource "aws_rds_cluster" "this" {
  cluster_identifier           = "${var.name_prefix}-aurora"
  engine                       = "aurora-mysql"
  engine_version               = "8.0.mysql_aurora.3.05.2"
  database_name                = var.db_name
  master_username              = var.db_user
  master_password              = random_password.db.result
  db_subnet_group_name         = aws_db_subnet_group.this.name
  vpc_security_group_ids       = [var.db_sg_id]
  db_cluster_parameter_group_name = aws_rds_cluster_parameter_group.this.name
  storage_encrypted            = true
  kms_key_id                   = var.kms_key_arn
  backup_retention_period      = var.backup_retention_days
  preferred_backup_window      = "03:00-04:00"
  deletion_protection          = var.environment == "prod"
  skip_final_snapshot          = var.environment != "prod"
  tags                          = var.tags
}

resource "aws_rds_cluster_instance" "this" {
  count                = var.cluster_size       # prod=3, uat=2, test=1
  identifier           = "${var.name_prefix}-aurora-${count.index}"
  cluster_identifier   = aws_rds_cluster.this.id
  instance_class       = var.db_instance_class
  engine               = aws_rds_cluster.this.engine
  engine_version       = aws_rds_cluster.this.engine_version
  db_subnet_group_name = aws_db_subnet_group.this.name
  tags                 = var.tags
}

resource "random_password" "db" {
  length  = 32
  special = true
}

resource "aws_secretsmanager_secret" "db" {
  name = "${var.name_prefix}/db/master"
}
resource "aws_secretsmanager_secret_version" "db" {
  secret_id     = aws_secretsmanager_secret.db.id
  secret_string = jsonencode({ username = var.db_user, password = random_password.db.result })
}
# envs/uat/main.tf - the env root just composes modules with uat values

module "network" {
  source     = "../../modules/network"
  name       = "lf-${var.environment}"
  cidr_block = var.vpc_cidr
  tags       = local.tags
}

module "security" {
  source      = "../../modules/security"
  name_prefix = "lf-${var.environment}"
  vpc_id      = module.network.vpc_id
}

module "iam" {
  source           = "../../modules/iam"
  name_prefix      = "lf-${var.environment}"
  artifacts_bucket = "lf-artifacts-${var.environment}"
  app_kms_key_arn  = aws_kms_key.app.arn
  tags             = local.tags
}

module "app" {
  source                = "../../modules/compute"
  name_prefix           = "lf-${var.environment}"
  environment           = var.environment
  private_subnet_ids    = module.network.private_subnet_ids
  app_sg_id             = module.security.app_sg_id
  instance_profile_name = module.iam.instance_profile_name
  instance_type         = var.instance_type
  min_size              = 2
  max_size              = 4
  desired_capacity      = 2
  tags                  = local.tags
}

module "db" {
  source                  = "../../modules/database"
  name_prefix             = "lf-${var.environment}"
  environment             = var.environment
  private_subnet_ids      = module.network.private_subnet_ids
  db_sg_id                = module.security.db_sg_id
  kms_key_arn             = aws_kms_key.app.arn
  cluster_size            = 2
  db_instance_class       = "db.r6g.large"
  backup_retention_days   = 14
  db_name                  = "appdb"
  db_user                  = "appadmin"
  tags                     = local.tags
}
Read this top to bottom. The env file is intentionally boring — just plumbing. All the cleverness lives in modules, which are tested once and reused four times.

07 Interactive walkthrough — provision a real change

Pick a scenario below and step through it. The terminal simulates what you would actually see when running these commands against the layout in section 4. Try the Terraform first apply path first, then the Ticketed change path which weaves Git in.

~/terraform-aws-platform/envs/test  ·  bash
# Pick a scenario below and click Next step.
step 0 / 0
Tip. The terminal output below is realistic but fabricated — nothing is actually executed. Use it to internalise the rhythm: init → fmt → validate → plan → review → apply.

08 Git workflow — from a ticket to merged main

Terraform is only as safe as the change-management process around it. The pattern below is the most boring, most reliable one I've seen work at scale. Trunk-based, short-lived branches, every plan visible on the PR.

The flow

main (always deployable) TicketINFRA-1842 Branchinfra-1842-... Edit codemodule + tfvars Local planon the env folder Push + PRCI runs fmt/validate/plan ReviewCODEOWNERS approve Squash mergeto main CD applyenv-by-env Ticket-driven Terraform change pipeline Branch is short-lived. CI plan is the source of truth for review. Apply only happens AFTER merge. Promotion order: test → uat → prod-support → prod · same code, different tfvars · abort if any env's plan looks wrong

Branch naming, commit messages, PR title

ItemConventionExample
Branch<ticket-id-lower>-<short-slug>infra-1842-uat-partner-cidr
CommitTICKET: imperative summary ≤ 72 charsINFRA-1842: open uat ALB to partner 203.0.113.0/24
PR titleSame as commit(GitHub auto-fills it)
PR bodyWhat / why / blast-radius / plan output / rollbacksee template below

Useful Git commands for infra work

Startgit checkout main && git pull
Branchgit checkout -b infra-1842-uat-partner-cidr
Stagegit add -A
Commitgit commit -m "INFRA-1842: open uat ALB to partner"
Pushgit push -u origin HEAD
Sync with main mid-PRgit fetch origin && git rebase origin/main
Resolve conflictgit status → edit → git add → git rebase --continue
Abort rebasegit rebase --abort
Drop a bad commitgit reset --soft HEAD~1
See remote PRsgh pr list / gh pr view 482
Re-run CI plangh pr comment --body "/replan"
Tag a releasegit tag -a v2026.05.06 -m "..."

Files every Terraform repo must have

.gitignore

# Terraform internals - never commit these
.terraform/
.terraform.lock.hcl    # actually, DO commit this one (next card)
*.tfstate
*.tfstate.*
*.tfplan
tfplan
crash.log
crash.*.log

# Local overrides
*.auto.tfvars
override.tf
override.tf.json

# IDE / OS
.idea/
.vscode/
.DS_Store
Never commit *.tfstate. It contains secrets in plaintext (RDS passwords, etc.). State lives in S3 + KMS, not in Git.

.terraform.lock.hcl — commit it

Pins the exact provider versions used to apply. Commit it so every engineer + CI runs identical providers. Without it, a new bug-fix release of aws provider can silently change your plan.

CODEOWNERS

# Default owner
*                            @lf/infra-platform

# Modules - any senior engineer can review
/modules/                    @lf/infra-platform

# Per-env: prod requires senior + security
/envs/prod/                  @lf/infra-seniors @lf/security
/envs/prod-support/          @lf/infra-seniors
/envs/uat/                   @lf/infra-platform
/envs/test/                  @lf/infra-platform

PR template (.github/PULL_REQUEST_TEMPLATE.md)

## Ticket
INFRA-####

## What
<1-2 sentences>

## Why
<business or compliance driver>

## Blast radius
- Envs touched:  test / uat / prod-support / prod
- Resources affected: + N, ~ M, - K

## Plan output
<paste or link CI artifact>

## Rollback plan
git revert + apply, OR `terraform apply` of previous tag

Protecting your state bucket

  • S3 bucket: versioning ON, KMS encryption, block public access, MFA Delete on prod bucket.
  • DynamoDB lock table: provisioned, with point-in-time recovery.
  • IAM: only the TerraformDeploy role can write; engineers get read-only.
  • Cross-account: prod state bucket lives in a third "tooling" account — or in the prod account with a tight bucket policy.

09 When something goes wrong — identify & recover

The hardest part of Terraform is not writing it — it is recovering when reality and state disagree. Here are the most common failure modes and how to think about them.

SymptomWhat it meansHow to fix
Error acquiring the state lock Another plan/apply is running, OR a previous one crashed without releasing the lock. Find who has the lock (the error shows their identity). Wait or talk to them. Only as last resort: terraform force-unlock <LOCK_ID>.
Plan wants to destroy a resource you didn't change Either someone removed it from .tf, or the resource address changed (rename, module move). If it was a rename: terraform state mv old.addr new.addr. If you want to keep but not manage: terraform state rm.
Plan wants to create something that already exists in AWS Resource was created in console (drift), or a previous apply lost it from state. terraform import <addr> <aws-id>, then re-plan to confirm clean diff.
InvalidParameterValue: VPC has dependencies on destroy Resources outside Terraform's state (manually created ENIs, peerings) are blocking deletion. Find them in console, decide to import or delete out-of-band, then retry.
Plan output is huge / nondeterministic Often: a tag set differs (e.g. AWS auto-injects a tag), or an attribute is computed. Use ignore_changes in lifecycle for known noise. Don't silence everything — you'll miss real diffs.
Apply fails halfway Some resources got created, some didn't. State reflects what succeeded. Re-run plan. Terraform will only create what's missing. Don't init -reconfigure in panic.
Provider version drift across the team Someone ran without committing .terraform.lock.hcl. Commit the lock file. Use terraform init -upgrade intentionally, in a dedicated PR.
Cycle in module graph Two resources reference each other. Break with a third resource (e.g. SG rule referencing two SGs by id) or via a data lookup.
Git merge conflict in *.tfvars or main.tf Two PRs touched the same file. git rebase origin/main, edit, terraform fmt, terraform validate, force-push the branch. Plan again before merging.
Force-pushed over someone else's commit Lost work in the branch (not main). git reflog to find the lost commit, git cherry-pick it back.
terraform plan shows resources you never wrote You're pointing at the wrong backend / state file. Common when copy-pasting backend.tf. Check .terraform/terraform.tfstate → it points to the bucket. Stop, fix the backend block, terraform init -reconfigure.

The "is it Terraform or Git?" decision tree

Something is wrong Did the error happen running git or terraform? Git side • merge conflict → rebase + edit • pushed wrong branch → reflog + cherry-pick • CI plan stale → rebase & force-push • lost commits → git reflog • wrong file committed → reset + recommit → the real cloud is unaffected. Terraform / AWS side • state lock → force-unlock (with care) • drift / surprise destroy → import or state rm • apply failed mid-way → re-plan, re-apply • provider mismatch → init -upgrade in PR • wrong backend → init -reconfigure → reality <-> state can diverge. git error terraform error
Golden rule. If in doubt, terraform plan against an empty branch first to see what state thinks reality looks like. plan never changes anything.

10 Cheatsheet

Terraform commands you'll use weekly

Initialise dirterraform init
Refresh providersterraform init -upgrade
Switch backendterraform init -reconfigure
Formatterraform fmt -recursive
Validateterraform validate
Plan (env-aware)terraform plan -var-file=test.tfvars -out=tfplan
Apply saved planterraform apply tfplan
Destroy (test only)terraform destroy -var-file=test.tfvars
List stateterraform state list
Inspect resourceterraform state show 'module.app.aws_autoscaling_group.app'
Rename in stateterraform state mv module.old module.new
Forget resourceterraform state rm 'aws_security_group.legacy'
Import existingterraform import 'aws_vpc.this' vpc-0abc123
Re-create on next applyterraform taint 'module.app.aws_instance.app[0]'
Outputsterraform output -json
Console (interactive)terraform console
Force-unlockterraform force-unlock LOCK_ID
Show plan as JSONterraform show -json tfplan

Git commands for infra PRs

Syncgit fetch origin && git pull --ff-only
Branchgit checkout -b infra-####-slug
Stagegit add -A
Diff stagedgit diff --staged
Commitgit commit -m "INFRA-####: ..."
Pushgit push -u origin HEAD
Rebase on maingit fetch origin && git rebase origin/main
Continue rebasegit rebase --continue
Abort rebasegit rebase --abort
Force push (safely)git push --force-with-lease
Recover lost commitgit reflog && git cherry-pick <sha>
Switch branch w/o losinggit stash && git checkout main
Open PR (gh CLI)gh pr create --fill
Squash & mergegh pr merge --squash --delete-branch

Mental model recap

1. Code is desired state

Your .tf files describe what you want. Terraform's job is to make AWS match.

2. State is recorded reality

S3 + DynamoDB. Never on a laptop. Never edited by hand.

3. Modules > copy-paste

Same module, different tfvars per env. That's how you get four reproducible environments.

4. Plan before apply

Read every line of the plan. If you don't understand a diff, stop.

5. Git is the audit trail

Every infra change is a PR. CI plan in PR comments. CODEOWNERS gate prod.

6. Blast radius via folders

One folder = one state = one env. Test can never break prod by accident.

You are now ready to: stand up a new VPC, write a module, wire two AWS accounts, take a ticket through to merged-and-applied, and recover when the state and reality disagree. The next 200 levels are about scaling these patterns — remote modules, Terragrunt or Terraform Stacks, OPA/Sentinel policy, and drift detection in CI.

11 Files deep-dive — what each one does, line by line

New engineers often look at a Terraform folder and see seven files with confusing names. Here is what each is for, why it exists, and what goes inside it. Terraform reads every .tf file in the current folder, alphabetically, and stitches them together as one big config. The filenames are pure convention — but follow the convention because that is what every reviewer expects.

11.1 — versions.tf · the contract

This file pins the Terraform version and provider versions. It is the first file the senior writes and the last one to change.

terraform {
  required_version = ">= 1.6.0, < 2.0.0"     # your binary must be in this range

  required_providers {
    aws = {
      source  = "hashicorp/aws"             # registry namespace
      version = "~> 5.74"                   # 5.74.x ok, 6.x not ok
    }
    random = { source = "hashicorp/random", version = "~> 3.6" }
    tls    = { source = "hashicorp/tls",    version = "~> 4.0" }
  }
}
Why version-pin. Without this, a new bug-fix release of the AWS provider (released yesterday) can change tomorrow's plan in subtle ways. Pin it; bump it intentionally in its own PR.

11.2 — providers.tf · how Terraform talks to AWS

provider "aws" {                          # the default (un-aliased) provider
  region = var.aws_region

  assume_role {                            # engineer's SSO → deploy role
    role_arn     = var.deploy_role_arn
    session_name = "tf-${var.environment}"
  }

  default_tags {                           # tag EVERY resource automatically
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
      Repo        = "terraform-aws-platform"
    }
  }
}

provider "aws" {                          # aliased provider, e.g. another region
  alias  = "us_west"
  region = "us-west-2"
}
# Inside a resource use:  provider = aws.us_west

11.3 — backend.tf · where state lives

terraform {
  backend "s3" {
    bucket         = "lf-tfstate-nonprod-222"     # the bucket per account
    key            = "uat/network.tfstate"          # <-- the per-env knob
    region         = "us-east-1"
    dynamodb_table = "lf-tfstate-locks"
    encrypt        = true
  }
}
Backend block cannot use variables. It is read before variables are even parsed. If you need different bucket/key per env, either keep separate backend.tf per env folder (recommended) or use terraform init -backend-config=....

11.4 — variables.tf · declarations only, no values

This file declares what the config accepts as input. It never holds values. Values come from *.tfvars, -var, or TF_VAR_* env vars (covered in section 12).

# variables.tf - declarations

variable "environment" {
  type        = string                       # required type
  description = "Logical env name: prod | prod-support | uat | test"

  validation {                              # enforce shape at plan time
    condition     = contains(["prod","prod-support","uat","test"], var.environment)
    error_message = "environment must be prod, prod-support, uat, or test."
  }
}

variable "vpc_cidr" {
  type        = string
  description = "VPC CIDR block, /16"
  default     = "10.0.0.0/16"             # default = optional input
}

variable "db_password" {
  type        = string
  sensitive   = true                         # hides value in plan output and outputs
  description = "Master DB password (typically supplied by Secrets Manager, not tfvars)"
}

variable "app_servers" {                     # complex types are first-class
  type = list(object({
    name          = string
    instance_type = string
    public        = bool
  }))
  default     = []
  description = "App tier sizing per server"
}

variable "common_tags" {
  type    = map(string)
  default = {}
}
Variable attributeWhat it does
typestring, number, bool, list(...), set(...), map(...), object({...}), tuple([...])
descriptionShows up in terraform plan hints and module docs. Always write it.
defaultOptional. If absent, value MUST be supplied at plan time.
sensitiveRedacts from plan/apply output. Still saved to state — protect state.
nullablefalse means callers cannot pass null.
validationReject bad values at plan time with a friendly error.

11.5 — data.tf · read-only lookups

Data sources let you query AWS without managing the resource. Latest AMI, an existing Route53 zone, the caller identity. Output of a data source is fresh on every run; that is good for AMIs (you want the newest) but means you can get unexpected diffs — pin AMIs in production.

# data.tf

data "aws_caller_identity" "current" {}        # who am I? - returns account_id, arn, user_id

data "aws_region" "current" {}                  # the region you're in

data "aws_availability_zones" "available" {       # list of AZs in the region
  state = "available"
}

data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

data "aws_route53_zone" "corp" {                 # reference an existing zone
  name         = "example.internal."
  private_zone = true
}

# Use them anywhere as data.<type>.<name>.<attr>
# e.g. data.aws_caller_identity.current.account_id
Resource vs Data source mental rule. If you'd be sad if it disappeared, make it a resource. If you only want to read it (because someone else owns it), make it a data.

11.6 — main.tf · the table of contents

Despite the name, main.tf rarely contains the bulk of your code in a real repo. Resources live in modules. The env-level main.tf is just "this env composes these modules".

# envs/uat/main.tf

# 1. local values - computed once, used in many places
locals {
  name_prefix = "lf-${var.environment}"
  account_id  = data.aws_caller_identity.current.account_id
  tags = merge(var.common_tags, {
    Environment = var.environment
    Account     = local.account_id
    ManagedBy   = "terraform"
  })
}

# 2. module composition - the actual stack for this env
module "network" {
  source     = "../../modules/network"
  name       = "${local.name_prefix}-vpc"
  cidr_block = var.vpc_cidr
  tags       = local.tags
}

module "security" {
  source      = "../../modules/security"
  name_prefix = local.name_prefix
  vpc_id      = module.network.vpc_id          # cross-module reference
  tags        = local.tags
}

# 3. one-off resources are fine here too if they're truly env-specific
resource "aws_kms_key" "app" {
  description             = "App KMS key for ${var.environment}"
  deletion_window_in_days = var.environment == "prod" ? 30 : 7
  tags                    = local.tags
}
Splitting main.tf when it grows. When the file gets past ~150 lines, split by concern: network.tf, compute.tf, database.tf, iam.tf. Terraform doesn't care — it concatenates anyway.