Terraform 301 — AWS Infrastructure Engineering

A hands-on training guide for engineers who already know AWS and want to provision it as code. We will go from "what is Terraform?" to a working multi-account, multi-environment layout (prod, prod-support, uat, test) with VPCs, subnets, security groups, IAM, EC2, and an SQL cluster — plus the Git workflow that wraps it.

Terraform 1.6+ AWS Provider 5.x 2 accounts × 4 environments 3-tier + standalone + SQL cluster

Part 2 · SOPs & BAU is in a companion file.
Files deep-dive (line-by-line: main.tf, variables.tf, data.tf, locals, outputs), env-vars & SSO credentials, the SOP for creating a new environment, the BAU SOP for editing existing infra, the new-engineer onboarding checklist, and the senior's 20 unwritten rules: open terraform-301-bau-sop.html (right-click → open in new tab).

01 What is Terraform & how it fits into IaC

Infrastructure as Code (IaC) is the practice of describing your cloud resources — VPCs, subnets, EC2, RDS, IAM — in declarative text files kept in version control, then having a tool reconcile reality with that description. The benefit is not "scripting AWS faster"; it is making infrastructure reviewable, reproducible, and auditable.

Where Terraform sits

Style	Tool	What you write	How it runs
Imperative scripts	AWS CLI, boto3, PowerShell	Steps ("create VPC, then subnet…")	You re-run carefully; no built-in idea of "current state"
Declarative, AWS-native	CloudFormation, CDK	Desired state in YAML/JSON or code	Runs inside AWS as a Stack
Declarative, multi-cloud	Terraform / OpenTofu	Desired state in HCL `.tf` files	Compares your code to a state file and produces a plan, then applies it
Config management	Ansible, Chef, Puppet	Steps for in-OS config	Runs against running servers (complementary, not a replacement)

Mental model: Terraform is a diff engine. It reads your .tf files (the desired state) and the terraform.tfstate (the recorded state), then asks the cloud provider to make reality match.

Why teams adopt it

Peer review. Infra changes go through pull requests just like application code.
Reproducibility. The same module deploys an identical VPC in test, uat, prod-support and prod — only the .tfvars differs.
Drift detection. If someone clicks-ops a change in the console, the next terraform plan shows it.
Blast-radius control. A single environment lives in its own state file and can be destroyed/rebuilt without touching the others.

Terraform vs OpenTofu: OpenTofu is the open-source fork after HashiCorp's BSL change. Commands and HCL are compatible. If your team uses terraform binary, everything in this guide applies; substitute tofu if you have switched.

02 Core concepts you must internalise

Provider

A plugin that talks to an API. hashicorp/aws talks to AWS. You configure region and credentials on it. You can have multiple aliased providers — that's how we hit two AWS accounts from one root.

Resource

A managed thing. aws_vpc, aws_subnet, aws_security_group. Each resource has a type and a local name you reference elsewhere as aws_vpc.main.id.

Data source

A read-only lookup. data "aws_ami" "al2023" finds an AMI without managing it. Use these for things you don't own (e.g. an account you only read from).

Variable

Inputs declared in variables.tf. Values come from *.tfvars, -var flags, env vars (TF_VAR_name), or defaults.

Output

What the module exposes after apply — e.g. the VPC id, subnet ids. Other configurations (or humans) consume them.

State

The JSON file that records what Terraform created. Source of truth for the diff engine. In real environments it lives in S3 with a DynamoDB lock, never on a laptop.

Module

A reusable folder of .tf files with inputs and outputs. "Network module", "ec2 module", "rds module". Modules are how you stop copy-pasting between environments.

Backend

Where state is stored. Configured once per root. We use the S3 + DynamoDB backend so multiple engineers can collaborate safely.

Workspace

A named slot inside one backend. Useful for very small setups. For real multi-env work we prefer separate folders + backends per env — clearer blast radius and IAM.

The lifecycle in five commands

terraform initterraform init

terraform fmt -check && validateterraform fmt -check && terraform validate

terraform planterraform plan -var-file=test.tfvars -out=tfplan

terraform applyterraform apply tfplan

terraform destroyterraform destroy -var-file=test.tfvars

terraform state listterraform state list

Never edit terraform.tfstate by hand. Use terraform state mv, terraform state rm, terraform import. We cover these in Troubleshooting.

03 The files: `main.tf`, `variables.tf`, `*.tfvars`

Terraform doesn't care what you call your files — it concatenates every .tf in a folder. But there is a strong convention every team should follow:

File	What goes in it	Edited per env?
`main.tf`	The actual resources and module calls (the "what to build").	No — same code for every env.
`variables.tf`	Declarations of inputs — name, type, description, default, validation.	No.
`outputs.tf`	Things to expose after apply (VPC id, subnet ids, ALB DNS).	No.
`providers.tf`	Provider config — region, alias, assume-role.	Sometimes (account id changes).
`backend.tf`	Where state lives (S3 bucket, key, DynamoDB lock table).	Yes — the state key is per-env.
`versions.tf`	Required Terraform & provider versions.	No.
`terraform.tfvars`	Default variable values (auto-loaded).	Avoid in multi-env — prefer named files.
`prod.tfvars`, `test.tfvars`…	Values per environment — instance sizes, CIDR blocks, account ids.	Yes — this is the per-env knob.

Tiny example to make it concrete

variable "environment" {
  type        = string
  description = "prod | prod-support | uat | test"
  validation {
    condition     = contains(["prod","prod-support","uat","test"], var.environment)
    error_message = "environment must be one of prod, prod-support, uat, test."
  }
}

variable "vpc_cidr" {
  type    = string
  default = "10.0.0.0/16"
}

variable "instance_type" {
  type    = string
  default = "t3.small"
}

variable "common_tags" {
  type    = map(string)
  default = {}
}

locals {
  name_prefix = "lf-${var.environment}"
  tags = merge(var.common_tags, {
    Environment = var.environment
    ManagedBy   = "terraform"
  })
}

module "network" {
  source      = "../../modules/network"
  name        = "${local.name_prefix}-vpc"
  cidr_block  = var.vpc_cidr
  tags        = local.tags
}

# test.tfvars  - non-prod account, smallest footprint
environment   = "test"
vpc_cidr      = "10.30.0.0/16"
instance_type = "t3.small"
common_tags = {
  CostCenter = "CC-1042"
  Owner      = "infra-platform"
  DataClass  = "internal"
}

# prod.tfvars  - prod account, hardened sizing
environment   = "prod"
vpc_cidr      = "10.10.0.0/16"
instance_type = "m6i.large"
common_tags = {
  CostCenter = "CC-1001"
  Owner      = "infra-platform"
  DataClass  = "confidential"
  Compliance = "sox"
}

output "vpc_id" {
  value       = module.network.vpc_id
  description = "VPC id created in this environment"
}

output "private_subnet_ids" {
  value = module.network.private_subnet_ids
}

Variable precedence (highest wins): explicit -var on the CLI > -var-file > *.auto.tfvars > terraform.tfvars > TF_VAR_* env vars > default in variables.tf. Knowing this lets you override safely in CI.

04 Multi-account × multi-environment layout

You described the real situation: two AWS accounts (a prod account and a non-prod account) hosting four logical environments:

Prod AWS account

prod — customer-facing workloads
prod-support — jump hosts, monitoring, backup tooling that needs to see prod

Non-prod AWS account

uat — user acceptance, prod-shaped data
test — integration / dev sandbox, smallest sizing

Recommended directory structure

terraform-aws-platform/ ├── modules/ # reusable building blocks │ ├── network/ # VPC, subnets, NAT, IGW, routes │ ├── security/ # SGs, NACLs, KMS keys │ ├── iam/ # roles, instance profiles, policies │ ├── compute/ # launch template, ASG, ALB │ └── database/ # RDS Multi-AZ / Aurora ├── envs/ │ ├── prod/ # prod account / prod env │ │ ├── backend.tf │ │ ├── providers.tf │ │ ├── main.tf │ │ ├── variables.tf │ │ ├── outputs.tf │ │ └── prod.tfvars │ ├── prod-support/ # prod account / prod-support env │ │ └── prod-support.tfvars │ ├── uat/ # non-prod account / uat env │ │ └── uat.tfvars │ └── test/ # non-prod account / test env │ └── test.tfvars ├── .github/ │ └── workflows/ # CI: fmt, validate, plan on PR ├── .gitignore ├── .terraform-version # pin via tfenv/tofuenv ├── CODEOWNERS └── README.md

Why folder-per-env, not workspaces? Each env has its own backend (S3 key), its own state, and can use its own IAM role. A blast in test cannot touch prod's state. Workspaces share the backend — risky for production.

Wiring two AWS accounts: provider aliases + assume-role

Each environment's root assumes a deployment role in the right account. Engineers' local AWS profile only needs permission to assume those roles; they never carry account access keys.

# envs/prod/providers.tf
provider "aws" {
  region = "us-east-1"

  assume_role {
    role_arn     = "arn:aws:iam::111111111111:role/TerraformDeploy"
    session_name = "tf-prod-${terraform.workspace}"
  }

  default_tags {
    tags = {
      Environment = "prod"
      ManagedBy   = "terraform"
    }
  }
}

# Optional second alias to read-only data from non-prod account
# (e.g. peering, AMI sharing)
provider "aws" {
  alias  = "nonprod_ro"
  region = "us-east-1"
  assume_role {
    role_arn = "arn:aws:iam::222222222222:role/TerraformReadOnly"
  }
}

Wiring backend (S3) per environment

# envs/prod/backend.tf
terraform {
  backend "s3" {
    bucket         = "lf-tfstate-prod-111111111111"
    key            = "prod/network.tfstate"
    region         = "us-east-1"
    dynamodb_table = "lf-tfstate-locks"
    encrypt        = true
  }
}

The four envs differ in three places only: backend key, provider role_arn, and the *.tfvars values. Everything else — main.tf, modules, code review — is shared.

Per-env knobs at a glance

Knob	prod	prod-support	uat	test
AWS account	111111111111	111111111111	222222222222	222222222222
VPC CIDR	10.10.0.0/16	10.11.0.0/16	10.20.0.0/16	10.30.0.0/16
EC2 size	m6i.large	t3.medium	t3.medium	t3.small
RDS Multi-AZ	true	n/a	true	false
Backups	35d, PITR	n/a	14d	1d
NAT	HA per-AZ	HA per-AZ	1 NAT	1 NAT
Approvers	2 senior	2 senior	1	1

05 AWS architecture — the picture you'll provision

Two diagrams. The first is the org-wide layout: two AWS accounts hosting four environments, all driven by the same Terraform repo. The second is one VPC zoomed in — the actual networking, compute, and data plane Terraform creates.

5.1 — Two AWS accounts × four environments

IAM / Account boundary (prod) Account boundary (non-prod) VPC / IGW Compute (EC2/ALB/NAT) RDS / DynamoDB S3

5.2 — Anatomy of one VPC (zoomed in)

This is what you actually write Terraform for. Trace the request path: client → ALB SG → web SG → app SG → db SG. Trace the IAM path: instance profile → role → managed + inline policies.

alb-sg app-sg :8080 from web-sg db-sg :3306 from app-sg Always reference SGs by id, never CIDRs. Multi-AZ replication

Traffic flow (orange) SG references (green) RDS replication / data IAM

06 Real Terraform code — VPC, SG, subnets, IAM, EC2, RDS

Below is the kind of code you'd actually write. Each tab is a real, copy-pasteable snippet. The convention: modules own the resources, envs/<env>/main.tf just calls those modules with environment-specific inputs.

# modules/network/main.tf  -  builds VPC + IGW + 2 public + 2 private subnets + NAT
resource "aws_vpc" "this" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags                 = merge(var.tags, { Name = var.name })
}

resource "aws_internet_gateway" "this" {
  vpc_id = aws_vpc.this.id
  tags   = merge(var.tags, { Name = "${var.name}-igw" })
}

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "public" {
  for_each                = { for idx, az in slice(data.aws_availability_zones.available.names, 0, 2) : az => idx }
  vpc_id                  = aws_vpc.this.id
  availability_zone       = each.key
  cidr_block              = cidrsubnet(var.cidr_block, 8, each.value + 1)
  map_public_ip_on_launch = true
  tags = merge(var.tags, {
    Name = "${var.name}-public-${each.key}"
    Tier = "public"
  })
}

resource "aws_subnet" "private" {
  for_each          = { for idx, az in slice(data.aws_availability_zones.available.names, 0, 2) : az => idx }
  vpc_id            = aws_vpc.this.id
  availability_zone = each.key
  cidr_block        = cidrsubnet(var.cidr_block, 8, each.value + 11)
  tags = merge(var.tags, {
    Name = "${var.name}-private-${each.key}"
    Tier = "private"
  })
}

resource "aws_eip" "nat" {
  for_each = aws_subnet.public
  domain   = "vpc"
}

resource "aws_nat_gateway" "this" {
  for_each      = aws_subnet.public
  allocation_id = aws_eip.nat[each.key].id
  subnet_id     = each.value.id
  tags          = merge(var.tags, { Name = "${var.name}-nat-${each.key}" })
}

# modules/network/outputs.tf
output "vpc_id"             { value = aws_vpc.this.id }
output "public_subnet_ids"  { value = [for s in aws_subnet.public  : s.id] }
output "private_subnet_ids" { value = [for s in aws_subnet.private : s.id] }

# modules/security/main.tf  -  the SG chain ALB -> web -> app -> db
# Use aws_vpc_security_group_*_rule (Terraform AWS provider 5.x) instead of inline rules.
# That way each rule is its own resource - cleaner diffs, no churn.

resource "aws_security_group" "alb" {
  name        = "${var.name_prefix}-alb"
  description = "ALB ingress from internet"
  vpc_id      = var.vpc_id
}

resource "aws_vpc_security_group_ingress_rule" "alb_https" {
  security_group_id = aws_security_group.alb.id
  cidr_ipv4         = "0.0.0.0/0"
  from_port         = 443
  to_port           = 443
  ip_protocol       = "tcp"
}

resource "aws_security_group" "web" {
  name        = "${var.name_prefix}-web"
  description = "Web tier - only ALB can reach it"
  vpc_id      = var.vpc_id
}
resource "aws_vpc_security_group_ingress_rule" "web_from_alb" {
  security_group_id            = aws_security_group.web.id
  referenced_security_group_id = aws_security_group.alb.id   # <- by id, not CIDR
  from_port                    = 80
  to_port                      = 80
  ip_protocol                  = "tcp"
}

resource "aws_security_group" "app" { name = "${var.name_prefix}-app"  vpc_id = var.vpc_id }
resource "aws_vpc_security_group_ingress_rule" "app_from_web" {
  security_group_id            = aws_security_group.app.id
  referenced_security_group_id = aws_security_group.web.id
  from_port = 8080  to_port = 8080  ip_protocol = "tcp"
}

resource "aws_security_group" "db"  { name = "${var.name_prefix}-db"   vpc_id = var.vpc_id }
resource "aws_vpc_security_group_ingress_rule" "db_from_app" {
  security_group_id            = aws_security_group.db.id
  referenced_security_group_id = aws_security_group.app.id
  from_port = 3306  to_port = 3306  ip_protocol = "tcp"
}

# Egress: explicit, not implicit "all"
# Each tier gets exactly what it needs.
resource "aws_vpc_security_group_egress_rule" "app_to_db" {
  security_group_id            = aws_security_group.app.id
  referenced_security_group_id = aws_security_group.db.id
  from_port = 3306  to_port = 3306  ip_protocol = "tcp"
}

Why "by id, not CIDR"? Referencing a SG by its id creates a stable, scalable rule. If you scale the ALB to 12 nodes, the web tier rule still works — no IPs changed. CIDR-based rules are the #1 reason teams accidentally open prod to the world.

# modules/iam/main.tf  -  EC2 instance profile (least privilege)

data "aws_iam_policy_document" "assume_ec2" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "ec2_app" {
  name               = "${var.name_prefix}-ec2-app"
  assume_role_policy = data.aws_iam_policy_document.assume_ec2.json
  tags               = var.tags
}

# Managed policy: SSM Session Manager (no SSH needed, ever)
resource "aws_iam_role_policy_attachment" "ssm" {
  role       = aws_iam_role.ec2_app.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

# Inline policy: app-specific permissions
data "aws_iam_policy_document" "app_inline" {
  statement {
    sid     = "ReadAppArtifacts"
    actions = ["s3:GetObject", "s3:ListBucket"]
    resources = [
      "arn:aws:s3:::${var.artifacts_bucket}",
      "arn:aws:s3:::${var.artifacts_bucket}/*",
    ]
  }
  statement {
    sid     = "DecryptWithAppKey"
    actions   = ["kms:Decrypt"]
    resources = [var.app_kms_key_arn]
  }
}

resource "aws_iam_role_policy" "app_inline" {
  name   = "app-inline"
  role   = aws_iam_role.ec2_app.id
  policy = data.aws_iam_policy_document.app_inline.json
}

# Bind role to an instance profile - this is what you attach to EC2
resource "aws_iam_instance_profile" "ec2_app" {
  name = "${var.name_prefix}-ec2-app"
  role = aws_iam_role.ec2_app.name
}

output "instance_profile_name" { value = aws_iam_instance_profile.ec2_app.name }

# Key pair (break-glass only - normal access is via SSM)
resource "tls_private_key" "break_glass" {
  algorithm = "ED25519"
}

resource "aws_key_pair" "break_glass" {
  key_name   = "${var.name_prefix}-break-glass"
  public_key = tls_private_key.break_glass.public_key_openssh
}

# Store the private key in SSM Parameter Store, not on a laptop
resource "aws_ssm_parameter" "break_glass_priv" {
  name  = "/${var.environment}/keys/break-glass"
  type  = "SecureString"
  value = tls_private_key.break_glass.private_key_pem
  tags  = var.tags
}

# Extra EBS volume for app data (separate from root)
resource "aws_ebs_volume" "app_data" {
  availability_zone = var.az
  size              = var.data_volume_size_gb     # e.g. 100
  type              = "gp3"
  iops              = 3000
  throughput        = 125
  encrypted         = true
  kms_key_id        = var.app_kms_key_arn
  tags              = merge(var.tags, { Name = "${var.name_prefix}-data" })
}

resource "aws_volume_attachment" "app_data" {
  device_name = "/dev/xvdf"
  volume_id   = aws_ebs_volume.app_data.id
  instance_id = aws_instance.app.id
}

# modules/compute/main.tf - launch template + autoscaling group + ALB target group

data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

resource "aws_launch_template" "app" {
  name_prefix            = "${var.name_prefix}-app-"
  image_id               = data.aws_ami.al2023.id
  instance_type          = var.instance_type
  key_name               = var.key_pair_name
  vpc_security_group_ids = [var.app_sg_id]

  iam_instance_profile {
    name = var.instance_profile_name
  }

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size = 30
      volume_type = "gp3"
      encrypted   = true
    }
  }

  user_data = base64encode(templatefile("${path.module}/userdata.sh.tpl", {
    environment = var.environment
  }))

  tag_specifications {
    resource_type = "instance"
    tags          = merge(var.tags, { Name = "${var.name_prefix}-app" })
  }
}

resource "aws_autoscaling_group" "app" {
  name                = "${var.name_prefix}-app-asg"
  vpc_zone_identifier = var.private_subnet_ids
  min_size            = var.min_size
  max_size            = var.max_size
  desired_capacity    = var.desired_capacity
  target_group_arns   = [aws_lb_target_group.app.arn]
  health_check_type   = "ELB"

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  tag {
    key                 = "Environment"
    value               = var.environment
    propagate_at_launch = true
  }
}

# modules/database/main.tf - Aurora MySQL cluster (writer + 2 readers, Multi-AZ)

resource "aws_db_subnet_group" "this" {
  name       = "${var.name_prefix}-db"
  subnet_ids = var.private_subnet_ids
  tags       = var.tags
}

resource "aws_rds_cluster_parameter_group" "this" {
  name   = "${var.name_prefix}-aurora-mysql"
  family = "aurora-mysql8.0"
  parameter {
    name  = "binlog_format"
    value = "ROW"
  }
}

resource "aws_rds_cluster" "this" {
  cluster_identifier           = "${var.name_prefix}-aurora"
  engine                       = "aurora-mysql"
  engine_version               = "8.0.mysql_aurora.3.05.2"
  database_name                = var.db_name
  master_username              = var.db_user
  master_password              = random_password.db.result
  db_subnet_group_name         = aws_db_subnet_group.this.name
  vpc_security_group_ids       = [var.db_sg_id]
  db_cluster_parameter_group_name = aws_rds_cluster_parameter_group.this.name
  storage_encrypted            = true
  kms_key_id                   = var.kms_key_arn
  backup_retention_period      = var.backup_retention_days
  preferred_backup_window      = "03:00-04:00"
  deletion_protection          = var.environment == "prod"
  skip_final_snapshot          = var.environment != "prod"
  tags                          = var.tags
}

resource "aws_rds_cluster_instance" "this" {
  count                = var.cluster_size       # prod=3, uat=2, test=1
  identifier           = "${var.name_prefix}-aurora-${count.index}"
  cluster_identifier   = aws_rds_cluster.this.id
  instance_class       = var.db_instance_class
  engine               = aws_rds_cluster.this.engine
  engine_version       = aws_rds_cluster.this.engine_version
  db_subnet_group_name = aws_db_subnet_group.this.name
  tags                 = var.tags
}

resource "random_password" "db" {
  length  = 32
  special = true
}

resource "aws_secretsmanager_secret" "db" {
  name = "${var.name_prefix}/db/master"
}
resource "aws_secretsmanager_secret_version" "db" {
  secret_id     = aws_secretsmanager_secret.db.id
  secret_string = jsonencode({ username = var.db_user, password = random_password.db.result })
}

# envs/uat/main.tf - the env root just composes modules with uat values

module "network" {
  source     = "../../modules/network"
  name       = "lf-${var.environment}"
  cidr_block = var.vpc_cidr
  tags       = local.tags
}

module "security" {
  source      = "../../modules/security"
  name_prefix = "lf-${var.environment}"
  vpc_id      = module.network.vpc_id
}

module "iam" {
  source           = "../../modules/iam"
  name_prefix      = "lf-${var.environment}"
  artifacts_bucket = "lf-artifacts-${var.environment}"
  app_kms_key_arn  = aws_kms_key.app.arn
  tags             = local.tags
}

module "app" {
  source                = "../../modules/compute"
  name_prefix           = "lf-${var.environment}"
  environment           = var.environment
  private_subnet_ids    = module.network.private_subnet_ids
  app_sg_id             = module.security.app_sg_id
  instance_profile_name = module.iam.instance_profile_name
  instance_type         = var.instance_type
  min_size              = 2
  max_size              = 4
  desired_capacity      = 2
  tags                  = local.tags
}

module "db" {
  source                  = "../../modules/database"
  name_prefix             = "lf-${var.environment}"
  environment             = var.environment
  private_subnet_ids      = module.network.private_subnet_ids
  db_sg_id                = module.security.db_sg_id
  kms_key_arn             = aws_kms_key.app.arn
  cluster_size            = 2
  db_instance_class       = "db.r6g.large"
  backup_retention_days   = 14
  db_name                  = "appdb"
  db_user                  = "appadmin"
  tags                     = local.tags
}

Read this top to bottom. The env file is intentionally boring — just plumbing. All the cleverness lives in modules, which are tested once and reused four times.

07 Interactive walkthrough — provision a real change

Pick a scenario below and step through it. The terminal simulates what you would actually see when running these commands against the layout in section 4. Try the Terraform first apply path first, then the Ticketed change path which weaves Git in.

~/terraform-aws-platform/envs/test · bash

# Pick a scenario below and click Next step.

step 0 / 0

Tip. The terminal output below is realistic but fabricated — nothing is actually executed. Use it to internalise the rhythm: init → fmt → validate → plan → review → apply.

08 Git workflow — from a ticket to merged main

Terraform is only as safe as the change-management process around it. The pattern below is the most boring, most reliable one I've seen work at scale. Trunk-based, short-lived branches, every plan visible on the PR.

The flow

Branch naming, commit messages, PR title

Item	Convention	Example
Branch	`<ticket-id-lower>-<short-slug>`	`infra-1842-uat-partner-cidr`
Commit	`TICKET: imperative summary ≤ 72 chars`	`INFRA-1842: open uat ALB to partner 203.0.113.0/24`
PR title	Same as commit	(GitHub auto-fills it)
PR body	What / why / blast-radius / plan output / rollback	see template below

Useful Git commands for infra work

Startgit checkout main && git pull

Branchgit checkout -b infra-1842-uat-partner-cidr

Stagegit add -A

Commitgit commit -m "INFRA-1842: open uat ALB to partner"

Pushgit push -u origin HEAD

Sync with main mid-PRgit fetch origin && git rebase origin/main

Resolve conflictgit status → edit → git add → git rebase --continue

Abort rebasegit rebase --abort

Drop a bad commitgit reset --soft HEAD~1

See remote PRsgh pr list / gh pr view 482

Re-run CI plangh pr comment --body "/replan"

Tag a releasegit tag -a v2026.05.06 -m "..."

Files every Terraform repo must have

.gitignore

# Terraform internals - never commit these
.terraform/
.terraform.lock.hcl    # actually, DO commit this one (next card)
*.tfstate
*.tfstate.*
*.tfplan
tfplan
crash.log
crash.*.log

# Local overrides
*.auto.tfvars
override.tf
override.tf.json

# IDE / OS
.idea/
.vscode/
.DS_Store

Never commit *.tfstate. It contains secrets in plaintext (RDS passwords, etc.). State lives in S3 + KMS, not in Git.

.terraform.lock.hcl — commit it

Pins the exact provider versions used to apply. Commit it so every engineer + CI runs identical providers. Without it, a new bug-fix release of aws provider can silently change your plan.

CODEOWNERS

# Default owner
*                            @lf/infra-platform

# Modules - any senior engineer can review
/modules/                    @lf/infra-platform

# Per-env: prod requires senior + security
/envs/prod/                  @lf/infra-seniors @lf/security
/envs/prod-support/          @lf/infra-seniors
/envs/uat/                   @lf/infra-platform
/envs/test/                  @lf/infra-platform

PR template (.github/PULL_REQUEST_TEMPLATE.md)

## Ticket
INFRA-####

## What
<1-2 sentences>

## Why
<business or compliance driver>

## Blast radius
- Envs touched:  test / uat / prod-support / prod
- Resources affected: + N, ~ M, - K

## Plan output
<paste or link CI artifact>

## Rollback plan
git revert + apply, OR `terraform apply` of previous tag

Protecting your state bucket

S3 bucket: versioning ON, KMS encryption, block public access, MFA Delete on prod bucket.
DynamoDB lock table: provisioned, with point-in-time recovery.
IAM: only the TerraformDeploy role can write; engineers get read-only.
Cross-account: prod state bucket lives in a third "tooling" account — or in the prod account with a tight bucket policy.

09 When something goes wrong — identify & recover

The hardest part of Terraform is not writing it — it is recovering when reality and state disagree. Here are the most common failure modes and how to think about them.

Symptom	What it means	How to fix
`Error acquiring the state lock`	Another `plan`/`apply` is running, OR a previous one crashed without releasing the lock.	Find who has the lock (the error shows their identity). Wait or talk to them. Only as last resort: `terraform force-unlock <LOCK_ID>`.
Plan wants to destroy a resource you didn't change	Either someone removed it from `.tf`, or the resource address changed (rename, module move).	If it was a rename: `terraform state mv old.addr new.addr`. If you want to keep but not manage: `terraform state rm`.
Plan wants to create something that already exists in AWS	Resource was created in console (drift), or a previous apply lost it from state.	`terraform import <addr> <aws-id>`, then re-plan to confirm clean diff.
`InvalidParameterValue: VPC has dependencies` on destroy	Resources outside Terraform's state (manually created ENIs, peerings) are blocking deletion.	Find them in console, decide to import or delete out-of-band, then retry.
Plan output is huge / nondeterministic	Often: a tag set differs (e.g. AWS auto-injects a tag), or an attribute is computed.	Use `ignore_changes` in `lifecycle` for known noise. Don't silence everything — you'll miss real diffs.
Apply fails halfway	Some resources got created, some didn't. State reflects what succeeded.	Re-run plan. Terraform will only create what's missing. Don't `init -reconfigure` in panic.
Provider version drift across the team	Someone ran without committing `.terraform.lock.hcl`.	Commit the lock file. Use `terraform init -upgrade` intentionally, in a dedicated PR.
Cycle in module graph	Two resources reference each other.	Break with a third resource (e.g. SG rule referencing two SGs by id) or via a data lookup.
Git merge conflict in `*.tfvars` or `main.tf`	Two PRs touched the same file.	`git rebase origin/main`, edit, `terraform fmt`, `terraform validate`, force-push the branch. Plan again before merging.
Force-pushed over someone else's commit	Lost work in the branch (not main).	`git reflog` to find the lost commit, `git cherry-pick` it back.
`terraform plan` shows resources you never wrote	You're pointing at the wrong backend / state file. Common when copy-pasting `backend.tf`.	Check `.terraform/terraform.tfstate` → it points to the bucket. Stop, fix the backend block, `terraform init -reconfigure`.

The "is it Terraform or Git?" decision tree

Golden rule. If in doubt, terraform plan against an empty branch first to see what state thinks reality looks like. plan never changes anything.

10 Cheatsheet

Terraform commands you'll use weekly

Initialise dirterraform init

Refresh providersterraform init -upgrade

Switch backendterraform init -reconfigure

Formatterraform fmt -recursive

Validateterraform validate

Plan (env-aware)terraform plan -var-file=test.tfvars -out=tfplan

Apply saved planterraform apply tfplan

Destroy (test only)terraform destroy -var-file=test.tfvars

List stateterraform state list

Inspect resourceterraform state show 'module.app.aws_autoscaling_group.app'

Rename in stateterraform state mv module.old module.new

Forget resourceterraform state rm 'aws_security_group.legacy'

Import existingterraform import 'aws_vpc.this' vpc-0abc123

Re-create on next applyterraform taint 'module.app.aws_instance.app[0]'

Outputsterraform output -json

Console (interactive)terraform console

Force-unlockterraform force-unlock LOCK_ID

Show plan as JSONterraform show -json tfplan

Git commands for infra PRs

Syncgit fetch origin && git pull --ff-only

Branchgit checkout -b infra-####-slug

Stagegit add -A

Diff stagedgit diff --staged

Commitgit commit -m "INFRA-####: ..."

Pushgit push -u origin HEAD

Rebase on maingit fetch origin && git rebase origin/main

Continue rebasegit rebase --continue

Abort rebasegit rebase --abort

Force push (safely)git push --force-with-lease

Recover lost commitgit reflog && git cherry-pick <sha>

Switch branch w/o losinggit stash && git checkout main

Open PR (gh CLI)gh pr create --fill

Squash & mergegh pr merge --squash --delete-branch

Mental model recap

1. Code is desired state

Your .tf files describe what you want. Terraform's job is to make AWS match.

2. State is recorded reality

S3 + DynamoDB. Never on a laptop. Never edited by hand.

3. Modules > copy-paste

Same module, different tfvars per env. That's how you get four reproducible environments.

4. Plan before apply

Read every line of the plan. If you don't understand a diff, stop.

5. Git is the audit trail

Every infra change is a PR. CI plan in PR comments. CODEOWNERS gate prod.

6. Blast radius via folders

One folder = one state = one env. Test can never break prod by accident.

You are now ready to: stand up a new VPC, write a module, wire two AWS accounts, take a ticket through to merged-and-applied, and recover when the state and reality disagree. The next 200 levels are about scaling these patterns — remote modules, Terragrunt or Terraform Stacks, OPA/Sentinel policy, and drift detection in CI.

11 Files deep-dive — what each one does, line by line

New engineers often look at a Terraform folder and see seven files with confusing names. Here is what each is for, why it exists, and what goes inside it. Terraform reads every .tf file in the current folder, alphabetically, and stitches them together as one big config. The filenames are pure convention — but follow the convention because that is what every reviewer expects.

11.1 — `versions.tf` · the contract

This file pins the Terraform version and provider versions. It is the first file the senior writes and the last one to change.

terraform {
  required_version = ">= 1.6.0, < 2.0.0"     # your binary must be in this range

  required_providers {
    aws = {
      source  = "hashicorp/aws"             # registry namespace
      version = "~> 5.74"                   # 5.74.x ok, 6.x not ok
    }
    random = { source = "hashicorp/random", version = "~> 3.6" }
    tls    = { source = "hashicorp/tls",    version = "~> 4.0" }
  }
}

Why version-pin. Without this, a new bug-fix release of the AWS provider (released yesterday) can change tomorrow's plan in subtle ways. Pin it; bump it intentionally in its own PR.

11.2 — `providers.tf` · how Terraform talks to AWS

provider "aws" {                          # the default (un-aliased) provider
  region = var.aws_region

  assume_role {                            # engineer's SSO → deploy role
    role_arn     = var.deploy_role_arn
    session_name = "tf-${var.environment}"
  }

  default_tags {                           # tag EVERY resource automatically
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
      Repo        = "terraform-aws-platform"
    }
  }
}

provider "aws" {                          # aliased provider, e.g. another region
  alias  = "us_west"
  region = "us-west-2"
}
# Inside a resource use:  provider = aws.us_west

11.3 — `backend.tf` · where state lives

terraform {
  backend "s3" {
    bucket         = "lf-tfstate-nonprod-222"     # the bucket per account
    key            = "uat/network.tfstate"          # <-- the per-env knob
    region         = "us-east-1"
    dynamodb_table = "lf-tfstate-locks"
    encrypt        = true
  }
}

Backend block cannot use variables. It is read before variables are even parsed. If you need different bucket/key per env, either keep separate backend.tf per env folder (recommended) or use terraform init -backend-config=....

11.4 — `variables.tf` · declarations only, no values

This file declares what the config accepts as input. It never holds values. Values come from *.tfvars, -var, or TF_VAR_* env vars (covered in section 12).

# variables.tf - declarations

variable "environment" {
  type        = string                       # required type
  description = "Logical env name: prod | prod-support | uat | test"

  validation {                              # enforce shape at plan time
    condition     = contains(["prod","prod-support","uat","test"], var.environment)
    error_message = "environment must be prod, prod-support, uat, or test."
  }
}

variable "vpc_cidr" {
  type        = string
  description = "VPC CIDR block, /16"
  default     = "10.0.0.0/16"             # default = optional input
}

variable "db_password" {
  type        = string
  sensitive   = true                         # hides value in plan output and outputs
  description = "Master DB password (typically supplied by Secrets Manager, not tfvars)"
}

variable "app_servers" {                     # complex types are first-class
  type = list(object({
    name          = string
    instance_type = string
    public        = bool
  }))
  default     = []
  description = "App tier sizing per server"
}

variable "common_tags" {
  type    = map(string)
  default = {}
}

Variable attribute	What it does
`type`	string, number, bool, list(...), set(...), map(...), object({...}), tuple([...])
`description`	Shows up in `terraform plan` hints and module docs. Always write it.
`default`	Optional. If absent, value MUST be supplied at plan time.
`sensitive`	Redacts from plan/apply output. Still saved to state — protect state.
`nullable`	`false` means callers cannot pass null.
`validation`	Reject bad values at plan time with a friendly error.

11.5 — `data.tf` · read-only lookups

Data sources let you query AWS without managing the resource. Latest AMI, an existing Route53 zone, the caller identity. Output of a data source is fresh on every run; that is good for AMIs (you want the newest) but means you can get unexpected diffs — pin AMIs in production.

# data.tf

data "aws_caller_identity" "current" {}        # who am I? - returns account_id, arn, user_id

data "aws_region" "current" {}                  # the region you're in

data "aws_availability_zones" "available" {       # list of AZs in the region
  state = "available"
}

data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

data "aws_route53_zone" "corp" {                 # reference an existing zone
  name         = "example.internal."
  private_zone = true
}

# Use them anywhere as data.<type>.<name>.<attr>
# e.g. data.aws_caller_identity.current.account_id

Resource vs Data source mental rule. If you'd be sad if it disappeared, make it a resource. If you only want to read it (because someone else owns it), make it a data.

11.6 — `main.tf` · the table of contents

Despite the name, main.tf rarely contains the bulk of your code in a real repo. Resources live in modules. The env-level main.tf is just "this env composes these modules".

# envs/uat/main.tf

# 1. local values - computed once, used in many places
locals {
  name_prefix = "lf-${var.environment}"
  account_id  = data.aws_caller_identity.current.account_id
  tags = merge(var.common_tags, {
    Environment = var.environment
    Account     = local.account_id
    ManagedBy   = "terraform"
  })
}

# 2. module composition - the actual stack for this env
module "network" {
  source     = "../../modules/network"
  name       = "${local.name_prefix}-vpc"
  cidr_block = var.vpc_cidr
  tags       = local.tags
}

module "security" {
  source      = "../../modules/security"
  name_prefix = local.name_prefix
  vpc_id      = module.network.vpc_id          # cross-module reference
  tags        = local.tags
}

# 3. one-off resources are fine here too if they're truly env-specific
resource "aws_kms_key" "app" {
  description             = "App KMS key for ${var.environment}"
  deletion_window_in_days = var.environment == "prod" ? 30 : 7
  tags                    = local.tags
}

Splitting main.tf when it grows. When the file gets past ~150 lines, split by concern: network.tf, compute.tf, database.tf, iam.tf. Terraform doesn't care — it concatenates anyway.

01 What is Terraform & how it fits into IaC

Where Terraform sits

Why teams adopt it

02 Core concepts you must internalise

Provider

Resource

Data source

Variable

Output

State

Module

Backend

Workspace

The lifecycle in five commands

03 The files: main.tf, variables.tf, *.tfvars

Tiny example to make it concrete

04 Multi-account × multi-environment layout

Prod AWS account

Non-prod AWS account

Recommended directory structure

Wiring two AWS accounts: provider aliases + assume-role

Wiring backend (S3) per environment

Per-env knobs at a glance

05 AWS architecture — the picture you'll provision

5.1 — Two AWS accounts × four environments

5.2 — Anatomy of one VPC (zoomed in)

06 Real Terraform code — VPC, SG, subnets, IAM, EC2, RDS

07 Interactive walkthrough — provision a real change

08 Git workflow — from a ticket to merged main

The flow

Branch naming, commit messages, PR title

Useful Git commands for infra work

Files every Terraform repo must have

.gitignore

.terraform.lock.hcl — commit it

CODEOWNERS

PR template (.github/PULL_REQUEST_TEMPLATE.md)

Protecting your state bucket

09 When something goes wrong — identify & recover

The "is it Terraform or Git?" decision tree

10 Cheatsheet

Terraform commands you'll use weekly

Git commands for infra PRs

Mental model recap

1. Code is desired state

2. State is recorded reality

3. Modules > copy-paste

4. Plan before apply

5. Git is the audit trail

6. Blast radius via folders

11 Files deep-dive — what each one does, line by line

11.1 — versions.tf · the contract

11.2 — providers.tf · how Terraform talks to AWS

11.3 — backend.tf · where state lives

11.4 — variables.tf · declarations only, no values

11.5 — data.tf · read-only lookups

11.6 — main.tf · the table of contents

03 The files: `main.tf`, `variables.tf`, `*.tfvars`

11.1 — `versions.tf` · the contract

11.2 — `providers.tf` · how Terraform talks to AWS

11.3 — `backend.tf` · where state lives

11.4 — `variables.tf` · declarations only, no values

11.5 — `data.tf` · read-only lookups

11.6 — `main.tf` · the table of contents