GlobalCorp AWS Troubleshooting Playbook

250 production scenarios · native commands · architecture diagrams · interactive labs · IaC fixes

Identity: IAM + Okta MFA Network: TGW hub-spoke AD: on-prem (corp.globalcorp.local) IaC: Terraform per account

Browser-based Works offline Progress saves locally

A vagmin.cloud-style hands-on playbook · 4-layer pedagogy · v1.0

Concept Foundation

How this playbook is taught — four layers, every scenario

You don't just read about a problem. You see it diagrammed, watch it play out, then run the commands yourself in a guided terminal.

1Concept

Read the symptom in plain language with GlobalCorp business context. The “why” shapes the “how”.

2Visual

See it diagrammed — architecture layers, network topology, traffic path, allowed vs blocked.

3Simulation

Walk the hypotheses & root cause as a guided sequence. Trade-offs and alternatives are called out.

4Lab

Type the real commands in an embedded terminal. Get simulated production output. Hints on demand.

Per-scenario slide map (8–10 slides)

#	Slide	Layer
1	Symptom & business impact	Concept
2	Architecture diagram (where it's blocking)	Visual
3	Hypotheses & debug method	Sim
4	Diagnose — native commands	Lab
5	Root cause	Sim
6	Fix — commands	Lab
7	IaC change (Terraform)	IaC
8	Cheeky / non-obvious	Concept
9	Prevent / monitor	Concept
10	Interactive lab terminal	Lab

Three habits that turn this into deep practice

Read the scenario first. Skim symptom + impact. The constraints frame the diagnosis.
Follow the traffic. Trace one request edge-to-target. The diagrams highlight the failing hop.
Compare trade-offs. Most fixes have a quick patch and a durable IaC fix. Note when an SCP would block the patch.

Conventions you'll see

Marker	Meaning
tip	Best practice or non-obvious trick
note	Common assumption to verify
gotcha	Bites you in production
IaC	Terraform / IaC change to lock in the fix

Concept Foundation

Fictional company — GlobalCorp Holdings

The business

~6,000 employees, financial services + 2 recent subsidiaries (FinSub, RetailSub)
Hybrid: 4 datacenters (NY, LDN, SG, FRA), Direct Connect to AWS in us-east-1, eu-west-1
Production: EC2 + ECS + RDS + FSx Windows, microservices in private subnets behind ALB/NLB
Identity: Okta for MFA & SSO, federated to per-account IAM via SAML (no IDC — legacy)
EC2 fleet domain-joined to on-prem AD (corp.globalcorp.local). DNS via R53 Resolver outbound endpoint
Each AWS account has its own Git repo + Terraform state in S3 + DynamoDB lock

Why this matters for troubleshooting

Cross-account routing via Transit Gateway with route domains per environment
SCP enforces tagging, region restriction, IMDSv2-only
RAM-shared resources: PHZ associations, prefix lists, TGW

Visual Foundation

AWS Organization — OUs & account inventory

Account inventory (used throughout)

Account	Alias	Purpose
111111111111	gc-mgmt	Org root, billing
222222222222	gc-log-archive	Central CloudTrail/Config logs
333333333333	gc-audit	Security Hub, GuardDuty admin
444444444444	gc-network	TGW, R53 Resolver, DX, network firewall
555555555555	gc-shared-svcs	FSx, AD connector, jump hosts
666666666666	gc-prod-app	Customer-facing microservices
777777777777	gc-prod-data	RDS, ElastiCache, FSx for SQL backups
888888888888	gc-stg-app	Staging mirror
999999999999	gc-dev-app	Dev workloads
121212121212	gc-tools-cicd	GitHub Actions OIDC, artifacts
131313131313	gc-finsub-prod	FinSub subsidiary prod
141414141414	gc-retailsub-poc	RetailSub PoC

Visual Foundation

Network architecture — TGW hub & spoke

Region AZ VPC Public Private Restricted On-prem ⓘ hover icons & arrows for details

Visual Foundation

Inventory catalog — reusable components behind every diagram

Every scenario diagram is built from this fixed set: 70+ AWS service icons in standard category colours, 7 container styles (region/AZ/VPC/subnet tiers/account/on-prem), and ~10 composite mini-diagrams. No bespoke geometry per scenario — updates here propagate everywhere.

Lab Foundation

Tooling inventory — what to reach for first

AWS-side

aws cli v2
SSM Run Command — PowerShell/bash, no SSH
SSM Session Manager — audited interactive shell
SSM Port Forwarding — tunnel 3389/1433/anything
VPC Reachability Analyzer
Network Access Analyzer
VPC Flow Logs (rich format)
R53 Resolver query logs
CloudTrail Lake (SQL across accounts)
IAM Access Analyzer

OS-side (Linux)

ip route get <ip>
ss -tnp
tracepath -n
mtr --report-wide -c 50
tcpdump -ni any host X and port Y
getent hosts / resolvectl
curl http://169.254.169.254/... (IMDSv2 token)
cloud-init-output.log
amazon-ssm-agent logs

OS-side (Windows)

Test-NetConnection -CommonTCPPort RDP -ComputerName X
Get-NetRoute / Get-NetIPConfiguration
Resolve-DnsName -Server X
nltest /sc_query:corp.globalcorp.local
dsregcmd /status
klist / klist purge
w32tm /query /status /verbose
Get-WinEvent -LogName System

Concept Foundation

Scenario index — 250 total, batched 25 at a time

#	Category	Count	Status
01	EC2 lifecycle & provisioning	20	live (S002–S021)
02	Security Groups & NACLs	20	partial (S022–S025 of 20)
03	IAM, instance roles, cross-account	20	queued
04	VPC, subnets, route tables	15	queued
05	Transit Gateway & cross-acct routing	15	queued
06	DNS / Route 53 / Resolver	15	queued
07	Active Directory & domain join	15	queued
08	Systems Manager (SSM)	15	queued

#	Category	Count	Status
09	VPC endpoints	15	queued
10	CloudWatch Logs & Metrics	15	queued
11	Load balancers (ALB/NLB)	15	queued
12	Backup & DR	15	queued
13	FSx & storage	10	queued
14	Okta / federation / MFA	15	queued
15	Terraform / IaC operations	20	queued
16	Org / SCP / Landing Zone	10	queued

Sample scenario S001 follows — demonstrates the full 10-slide pattern with the embedded lab terminal so you can validate the format before Batch 1 (20 EC2 scenarios).

Scenario S001 · EC2 lifecycle · Active Directory

EC2 launches but fails to domain-join corp.globalcorp.local

A new prod-app Windows instance comes up, user-data runs, but Add-Computer fails with An Active Directory domain controller for the domain could not be contacted. App team is blocked.

Severity: P2 Layer touched: DNS · Network · AD 10 slides · lab included

Concept EC2 · AD S001 · 1/10

Symptom & business impact

What the user sees

Terraform apply is green; instance i-0abc123 reaches running.
App team RDPs in via SSM port-forward, opens Event Viewer:

System > NETLOGON > Event 5719:
This computer was not able to set up a secure session with a domain
controller in domain corp.globalcorp.local because of the following:
The remote procedure call was cancelled.

cloud-init-output.log:
Add-Computer : An Active Directory domain controller (AD DC) for the
domain corp.globalcorp.local could not be contacted.

Business impact

Orders microservice deployment paused (needs domain-joined service account)
SLO at risk: 30 min budget left before stakeholder Slack at 4·9s
Recurring — happened to 3 instances this week, hand-fixed each time

Constraints (read these first)

Constraint	Implication
SCP requires IMDSv2	User-data must use IMDSv2 token call
SCP denies `iam:CreateUser`	Domain-join uses a vaulted AD service account, not IAM user
VPC has no public subnet	No direct internet to `secretsmanager.amazonaws.com` — must use VPCe
R53 Resolver outbound rule for corp.globalcorp.local	DNS query must reach Resolver → corp DCs over TGW
SG `sg-prod-private-windows-ingress`	Egress 53/88/389/445/etc to `pl-corp-onprem`

Note: the same NETLOGON 5719 + RPC cancelled string covers seven different root causes — SG, NACL, TGW route, DNS, secret access, time skew, OU permission. We'll narrow it.

Visual EC2 · AD S001 · 2/10

Where the traffic should go (and where it's probably blocked)

allowed hop blocked / dropped region VPC AZ private ⓘ hover icons & arrows for details

Sim EC2 · AD S001 · 3/10

Hypotheses & debug method — narrow before you bisect

Hypothesis tree (5-step)

#	Layer	Hypothesis	Falsify with
H1	Identity	Instance can't fetch `svc-domjoin` secret (no role / wrong KMS)	`aws sts get-caller-identity` · `get-secret-value` from instance
H2	Reach	SG egress missing port to corp DCs	Reachability Analyzer 53/88/389/445
H3	Reach	TGW route table missing `10.0.0.0/8`	`aws ec2 search-tgw-routes`
H4	Reach	Inspection FW dropping RPC dynamic 49152–65535	FW logs + `Test-NetConnection -Port 50000`
H5	DNS	R53 Resolver rule missing/disassociated from VPC	`list-resolver-rule-associations`
H6	Auth	Time skew > 5 min → Kerberos refuses	`w32tm /query /status`
H7	Auth	Service account `svc-domjoin` lacks Add Computer right on target OU	DC sec event 4625 + delegation review

Bisection plan

DNS first. If DNS fails everything else looks broken. Resolve-DnsName -Type SRV _ldap._tcp.dc._msdcs.corp.globalcorp.local
L4 reach to a DC. Test-NetConnection 10.10.0.10 -Port 389 & -Port 445.
RPC dynamic. Most cheeky — FW rule passes well-known but drops 49152–65535. Pick a random port, test.
Identity. Pull the secret from instance to confirm role + KMS pass.
Auth. Time + OU rights are the last 5%.

Cheeky: use VPC Reachability Analyzer on a single representative port (445) to prove TGW + inspection paths in seconds — before opening a console session.

Lab EC2 · AD S001 · 4/10

Diagnose — native commands you'll run

From your laptop (cross-account)

# confirm identity + assume into prod-app
aws sts get-caller-identity
aws sts assume-role --role-arn arn:aws:iam::666666666666:role/FedAppDev \
    --role-session-name s001 --query Credentials

# Reachability Analyzer (config-time check)
aws ec2 create-network-insights-path \
    --source i-0abc123 --destination 10.10.0.10 \
    --protocol TCP --destination-port 445
aws ec2 start-network-insights-analysis --network-insights-path-id nip-...

Inside the instance via SSM Run Command (cheeky)

aws ssm send-command --instance-ids i-0abc123 \
  --document-name AWS-RunPowerShellScript \
  --parameters 'commands=[
    "Resolve-DnsName -Type SRV _ldap._tcp.dc._msdcs.corp.globalcorp.local",
    "Test-NetConnection 10.10.0.10 -Port 389",
    "Test-NetConnection 10.10.0.10 -Port 445",
    "Test-NetConnection 10.10.0.10 -Port 50000",
    "w32tm /query /status",
    "klist"
  ]'

What “good” looks like

Name                                    Type   TTL  Section
----                                    ----   ---  -------
_ldap._tcp.dc._msdcs.corp.globalcorp... SRV    600  Answer
  Priority : 0  Port : 389  Target : dc1-ny.corp...

ComputerName     : 10.10.0.10
RemoteAddress    : 10.10.0.10
RemotePort       : 389
TcpTestSucceeded : True

ComputerName     : 10.10.0.10
RemotePort       : 50000
TcpTestSucceeded : False   # <-- this is our smoking gun

Cross-check on the firewall side

# Network Firewall log query in CloudWatch Logs Insights
fields @timestamp, src, dst, dst_port, action
| filter dst = "10.10.0.10"
| filter src like /^10\.20\.10\./
| filter action = "DROP"
| stats count() by dst_port

Press Next: the lab terminal at slide 10/10 lets you type these commands and see simulated production output.

Sim EC2 · AD S001 · 5/10

Root cause — RPC dynamic range dropped at Inspection FW

Story

The Network Firewall stateful rule group gc-corp-ad-allow permits the well-known AD ports: 53/88/389/636/3268/3269/445.
It does not permit the RPC dynamic range 49152–65535 (Win Server 2008+ default; AD uses these for SAM/LSAD/NetLogon RPCs).
The first phase of Add-Computer succeeds (DNS & LDAP work), but the secure channel setup needs a dynamic RPC port → FW drops → client retries → eventually surfaces as RPC was cancelled.
This was masked because the FW was deployed by netsec team using an upstream “AD allow” rule template that pre-dates dynamic RPC.

Why this took a week to find

Misleading signal	Why
NETLOGON 5719 fires for many causes	Same code, 7+ root causes
Reachability Analyzer (cfg) passes 445	Doesn't test stateful FW dynamic ports
FW logs in central acct	Devs lack read on log-archive
Sometimes works (race)	If RPC happens to negotiate <49152, it passes

Gotcha: AWS docs list the “ports for AD” without the dynamic range. The Microsoft docs explicitly require 49152–65535/tcp for outbound RPC dynamic. Always check both.

Cheeky: on the DC side you can restrict the dynamic range via registry — many shops set it to 50000-50099 and only allow that subset through the firewall.

Lab EC2 · AD S001 · 6/10

Fix — immediate unblock + correct fix

Immediate (PR + console — gc-network)

# 1. Patch the Network Firewall rule group to permit
#    the AD RPC dynamic range OR a constrained sub-range.
aws network-firewall describe-rule-group \
    --rule-group-name gc-corp-ad-allow --type STATEFUL --query 'RuleGroup' > rg.json

# 2. Append rule (Suricata syntax) and update.
# pass tcp $HOME_NET any -> $CORP_DC any (msg:"AD RPC dyn"; \
#   flow:to_server,established; sid:1000201; rev:1; \
#   dst_port:[49152:65535];)
aws network-firewall update-rule-group \
    --rule-group-name gc-corp-ad-allow --type STATEFUL \
    --rule-group file://rg.json --update-token <token>

Validate

aws ssm send-command --instance-ids i-0abc123 \
  --document-name AWS-RunPowerShellScript \
  --parameters 'commands=[
    "Test-NetConnection 10.10.0.10 -Port 50000",
    "Add-Computer -DomainName corp.globalcorp.local -Credential $cred -Restart"
  ]'

Correct fix — constrain the range, codify it

Ask the AD team to fix the dynamic port range to 50000–50099 on all DCs (registry) and publish it as a documented contract.
Update Network Firewall rule to permit only that constrained range.
Update the SG egress on Windows EC2 to allow the same subset to pl-corp-onprem.
Document in the AD SRE runbook + tag the FW rule group with Owner=netsec, References=AD-DC-Ports.

Cheeky: if AD team can't change DCs quickly, you can require Kerberos-only RPC on the client (KrbtgtFullPacSignature + RestrictNTLM) which sidesteps some legacy SAM RPC paths — only do this with AD's blessing.

IaC EC2 · AD S001 · 7/10

Lock the fix into Terraform — repo: gc-network

Branch + plan

# in gc-network repo
git checkout -b fix/nf-ad-rpc-dynamic
# edit modules/inspection-fw/rules/ad-allow.suricata
git diff --stat
terraform fmt -recursive
terraform validate
terraform plan -var-file=envs/us-east-1.tfvars \
   -target=module.inspection_fw.aws_networkfirewall_rule_group.ad_allow

Snippet of the change (HCL + Suricata)

resource "aws_networkfirewall_rule_group" "ad_allow" {
  capacity = 200
  name     = "gc-corp-ad-allow"
  type     = "STATEFUL"
  rule_group {
    rules_source {
      rules_string = file("${path.module}/rules/ad-allow.suricata")
    }
    rule_variables {
      ip_sets {
        key = "CORP_DC"
        ip_set { definition = ["10.10.0.10/32", "10.10.0.11/32"] }
      }
    }
  }
  tags = local.tags
}

ad-allow.suricata (added rule)

# existing well-known AD ports...
pass tcp $HOME_NET any -> $CORP_DC [53,88,389,636,3268,3269,445] \
  (msg:"AD well-known"; sid:1000101; rev:2;)

# NEW: AD RPC dynamic range (constrained to 50000-50099)
pass tcp $HOME_NET any -> $CORP_DC 50000:50099 \
  (msg:"AD RPC dyn constrained"; flow:to_server,established;
   sid:1000201; rev:1;)

PR + apply

Open PR; checkov + tflint green; CI posts plan to PR.
Reviewers: @netsec-leads, @ad-leads.
Merge → apply.yml assumes FedTerraformApply via OIDC, runs terraform apply.
Drift-check job nightly; if FW console-edited again, drift opens an issue.

IaC note: the dynamic range is now an input variable var.ad_rpc_range = "50000-50099" with the same value referenced by the per-spoke SG egress modules — one source of truth.

Concept EC2 · AD S001 · 8/10

Cheeky / non-obvious tricks pulled in this scenario

SSM-as-PowerShell

You never RDP'd into the host. aws ssm send-command with AWS-RunPowerShellScript ran the diagnostics with full audit (CloudTrail + SSM Run Command history). For interactive work, aws ssm start-session --target i-... is your shell.

Reachability Analyzer's blind spot

It evaluates config: SG, NACL, route table, TGW. It does not evaluate stateful Network Firewall rules. If RA says reachable but it's not, suspect inspection FW, host firewall, MTU, or asymmetric routes.

FW log query without log-archive read

Devs often lack read on gc-log-archive. We expose a cross-account CloudTrail Lake datastore + a read-only log-insights view via aws-vault assume-role onto a FedNetTroubleshoot role — can query FW logs without copying data.

Constrain dynamic ports

Default AD RPC dynamic range is huge. We pin DCs to 50000–50099 and document it as the AD-team contract. FW rule shrinks from a 16k-port hole to 100 ports.

Imdsv2 + secret pull from user-data

SCP enforces IMDSv2. Our user-data fetches the IMDSv2 token first, then the role creds, then the secret. If you script the IMDSv1 way it silently 401s and the domain-join “just” fails.

$tk = Invoke-RestMethod -Headers @{"X-aws-ec2-metadata-token-ttl-seconds"="300"} \
   -Method PUT -Uri "http://169.254.169.254/latest/api/token"

Tagging hint

Add DomainJoin=required tag at launch. A maintenance window step waits for that tag, then runs the GC-JoinOnPremAD doc — lets you re-run domain join idempotently after a fix without rebuilding the host.

Concept EC2 · AD S001 · 9/10

Prevention & monitoring — never see this again

Detective controls (CloudWatch & Config)

CW alarm on Network Firewall metric DroppedPackets with dimension StatefulRuleGroup=gc-corp-ad-allow — non-zero in 5 min → PagerDuty.
CW Logs Insights saved query: “AD RPC drops to corp DCs” bookmarked in dashboard gc-ad-health.
Config rule: required-network-firewall-rule-group-tags — rule groups must carry References=AD-DC-Ports tag (…so they show up in this audit).
EventBridge on SSMRunCommand success/failure for GC-JoinOnPremAD; failures trigger a Lambda that posts diagnostics to Slack #ad-domain-join.

Preventive controls

Synthetic canary: a small Windows EC2 in shared-svcs runs a Test-NetConnection matrix to all corp DCs every 5 min, emits CW custom metric.
Pull request CI: any change to aws_networkfirewall_rule_group requires @ad-leads review (CODEOWNERS).
SCP exception watcher: if anyone tries to disable IMDSv2 enforcement, audit acct alerts.
Runbook link on every NETLOGON 5719 alert that points to this scenario's slide deck.

SLO bookkeeping: add “domain-join success within 5 min of launch” to your platform SLO. The synthetic canary measures it.

Lab EC2 · AD S001 · 10/10

Interactive lab — type the commands, see production output

Lab: Diagnose AD domain-join failure for i-0abc123 simulated · offline

Objectives — complete in any order. Type hint, show, reset, or list at any time.

PS C:\Windows\system32>

0 / 0 objectives

Batch 1 · 20 scenarios · EC2 lifecycle

EC2 lifecycle & provisioning failures

Twenty production scenarios — pending instances, failed user-data, cross-account KMS, ASG drift, SCP blocks, ENI exhaustion, SSM-agent gaps, spot/recovery races. Every scenario follows the 8-slide pattern with an interactive lab terminal.

S002 → S021 160 slides 20 lab terminals

ConceptEC2 lifecycleS002 · 1/8

Instance stuck in pending for 8 min then `failed`

What happened

Terraform apply succeeds. aws ec2 run-instances returns an InstanceId.
Instance shows pending for ~8 min then transitions to shutting-down → terminated.
StateReason: Server.InternalError: Internal error on launch.

Business impact

Blue/green deploy can't expand green fleet. Stuck behind quota of 1 deploy in flight.
Repeats for 4/10 instances, randomly — not deterministic by AZ or instance type.

Constraints

Item	Detail
AMI	shared from `gc-tools-cicd` (121212121212)
Root volume	EBS encrypted with customer KMS key in `gc-tools-cicd`
Launching account	`gc-prod-app` (666666666666)
Default EBS encryption	on, with account-default KMS key in 666... (different key)
Service role	`AWSServiceRoleForAutoScaling`

Note: Server.InternalError is the polite version of “something on the EC2 side blew up” — almost always EBS attach, ENI attach, or KMS.

VisualEC2 lifecycleS002 · 2/8

The cross-account KMS path that breaks

AWS account blocked path fix path ⓘ hover icons & arrows

most-missed step in cross-acct shares 3. AWSServiceRoleForAutoScaling has CreateGrant 4. Snapshot shared with 666 (modify-snapshot-attribute) 5. AMI has launch permission for 666 Verify with: kms get-key-policy · kms list-grants cloudtrail lookup-events ResourceName=ami-... ec2 describe-snapshot-attribute gc-prod-app (666666666666) ASG → RunInstances EC2 LT EBS attach FAILS kms:CreateGrant denied principal: AWSServiceRoleForAutoScaling default EBS key orders-task-role root EBS CW alarm Alternative: re-encrypt copy copy-image snap (re-enc) spoke KMS Trade-offs + no cross-acct grants to maintain + blast radius isolated to spoke + simpler IAM to audit − double EBS snapshot storage cost − AMI promotion job needs spoke role + KMS − multi-region: copy per region Pick by org-shape Few spokes → cross-acct grants OK Many spokes / strict isolation → copy-encrypt Compliance requires single-key per acct → copy-encrypt Service Catalog gate Promotion product validates: launch-perm, snapshot-share, KMS grant per spoke in scope If any missing → promotion fails before deploy. Result: cross-acct AMI promotions are auditable.

SimEC2 lifecycleS002 · 3/8

Hypotheses & quickest disproof

#	Hypothesis	Disprove with
H1	EBS attach fails — KMS cross-acct grant missing	`describe-instance-attribute --attribute reason`
H2	ENI attach fails — subnet/AZ ran out of IPs	`describe-subnets` AvailableIpAddressCount
H3	AZ capacity (Insufficient)	StateReasonMessage contains `Insufficient capacity`
H4	Tenancy mismatch (dedicated host expired)	`describe-host-reservations`
H5	SCP blocking iam:PassRole during launch	CloudTrail event `RunInstances` errorCode

CloudTrail look-up

aws cloudtrail lookup-events \
   --lookup-attributes AttributeKey=ResourceName,AttributeValue=i-0xx \
   --max-results 5 --query 'Events[].CloudTrailEvent' \
   | jq -r '.' | jq 'select(.errorCode!=null) | {errorCode,errorMessage}'

Cheeky: RunInstances is async — the API call is fine, but the failure surfaces in EBS and EC2 events that come later. Lookup by ResourceName, not eventName.

LabEC2 lifecycleS002 · 4/8

Diagnose — commands you'll run

# 1. Pull the StateReason directly
aws ec2 describe-instances --instance-ids i-0xx \
  --query 'Reservations[].Instances[].{S:State.Name,R:StateReason}'

# 2. Pull instance-status (more granular)
aws ec2 describe-instance-status --instance-ids i-0xx \
  --include-all-instances

# 3. Inspect the snapshot encryption + KMS key
aws ec2 describe-snapshots --snapshot-ids snap-0xx \
  --query 'Snapshots[].{Enc:Encrypted,KMS:KmsKeyId,Owner:OwnerId}'

# 4. Check key policy in the source account
aws --profile gc-tools kms get-key-policy \
   --key-id alias/tools-ami --policy-name default | jq

# 5. List grants on the key (look for our role)
aws --profile gc-tools kms list-grants \
   --key-id alias/tools-ami \
   --query 'Grants[?contains(GranteePrincipal,`666666666666`)]'

# 6. Try the decrypt directly with an exec-role on a test instance
aws ssm send-command --instance-ids i-test \
  --document-name AWS-RunShellScript \
  --parameters 'commands=[
    "aws kms describe-key --key-id arn:aws:kms:us-east-1:121212121212:key/aaa..."
  ]'

Gotcha: the error in CloudTrail will be AccessDenied on kms:Decrypt with the principal AWSServiceRoleForAutoScaling — not the user/role that called RunInstances.

SimEC2 lifecycleS002 · 5/8

Root cause

The chain of events

AMI ami-prod-base uses an encrypted snapshot backed by KMS key arn:aws:kms:us-east-1:121212121212:key/aaa-tools-ami.
When ASG scales up in gc-prod-app, the launch goes through the service-linked role AWSServiceRoleForAutoScaling.
That role calls kms:CreateGrant on the source key on behalf of EC2/EBS.
The KMS key policy grants 666... root but does not grant kms:CreateGrant to aws-service-role/autoscaling.amazonaws.com.
EBS attach silently fails after instance moves to pending; EC2 retries the EBS detach/re-attach for ~8 min, then gives up → Server.InternalError.

Gotcha: KMS errors during EC2/ASG launch are not surfaced as KMS errors in the EC2 console. You must look at CloudTrail in the source account (where the key lives), not the launching account.

IaCEC2 lifecycleS002 · 6/8

Fix — key policy + Terraform

Key policy patch (in `gc-tools-cicd` repo)

data "aws_iam_policy_document" "tools_ami_key" {
  statement {
    sid = "AllowSpokeAccountsToUseKey"
    actions = ["kms:Decrypt","kms:DescribeKey",
               "kms:ReEncrypt*","kms:GenerateDataKey*",
               "kms:CreateGrant"]
    principals { type="AWS"; identifiers=["arn:aws:iam::666666666666:root"] }
    resources = ["*"]
    condition {
      test     = "StringEquals"
      variable = "kms:ViaService"
      values   = ["ec2.us-east-1.amazonaws.com"]
    }
  }
}

Spoke side — ASG service role + KMS

resource "aws_iam_role_policy" "asg_kms" {
  role = "AWSServiceRoleForAutoScaling"
  policy = jsonencode({
    Version="2012-10-17",
    Statement=[{
      Effect="Allow",
      Action=["kms:CreateGrant","kms:Decrypt",
              "kms:ReEncrypt*","kms:GenerateDataKey*",
              "kms:DescribeKey"],
      Resource="arn:aws:kms:us-east-1:121212121212:key/aaa-tools-ami"
    }]
  })
}

IaC note: the cleaner pattern is to wrap this in a module that takes spoke_account_ids + kms_key_arn and emits both the key policy statement and the spoke IAM role policy from a single locals.tf source of truth.

ConceptEC2 lifecycleS002 · 7/8

Cheeky & prevention

Cheeky #1

Use VPC Reachability Analyzer? No — this is KMS, not network. Use IAM Access Analyzer (cross-account) to surface keys exposed/granted across accounts before the launch even happens.

Cheeky #2

Pre-flight: aws ec2 run-instances --dry-run only checks the calling principal — not the EBS KMS chain. Bake an explicit kms:DescribeKey probe into your AMI promotion job.

Cheeky #3

If you can't change the source key policy, copy the AMI into the spoke account and re-encrypt with the local default key. The cost is double-storage; the win is no cross-acct grants to maintain.

Prevent — CW alarm

Alarm on the EBS metric VolumeAttachFailures (custom via EventBridge on AttachVolume errorCode), routes to #platform-pager.

Prevent — Config rule

kms-cmk-not-scheduled-for-deletion + a custom rule that flags any KMS key shared cross-acct that is missing CreateGrant to autoscaling.amazonaws.com.

Prevent — Service Catalog

The AMI promotion product validates: launch perm + snapshot share + KMS grant exist for every spoke account in scope. If not, promotion fails.

LabEC2 lifecycleS002 · 8/8

Interactive lab

Lab S002: Find the KMS cross-account grant gapsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS003 · 1/8

Symptom — user-data didn't run on first boot

Observed

Instance boots; SSM Session is fine; but the orders-api service isn't installed.
cloud-init-output.log is empty on Linux; EC2Launch log shows “UserData persist disabled” on Windows.
Manually re-running the script works.

Impact

Auto-built fleet has 30% no-op instances; canary fails to flip green.

Constraints

Item	Detail
OS	Linux: AL2023; Windows: Server 2022 EC2Launch v2
Launch source	Launch Template v6 (just promoted)
AMI	baked yesterday from custom pipeline
User-data	shell script (Linux) / `<powershell>...</powershell>` (Win)

Note: “User-data didn't run” has 4 distinct root causes by frequency: (1) AMI was sysprep'd w/o EC2Launch persist, (2) cloud-init disabled in baked image, (3) MIME multi-part malformed, (4) #cloud-config typo.

VisualEC2 lifecycleS003 · 2/8

Boot stages where user-data is supposed to run

Linux boot path Windows boot path state-leak (the bug) ⓘ hover stages for detail

PackerCleanupRun=true — else SSM parameter refuses promotion. Lifecycle precondition prevents bad bakes from reaching prod ASGs.

SimEC2 lifecycleS003 · 3/8

Hypotheses

#	Hypothesis	Disproof
H1	AMI baked with cloud-init semaphores already present (Linux)	`ls /var/lib/cloud/sem/` on baked AMI
H2	AMI baked w/o running `EC2Launch SysprepInstance` (Win)	`EC2Launch.exe sysprep --shutdown` log
H3	User-data MIME multi-part missing `Content-Type: text/x-shellscript`	`head -c 500 /var/lib/cloud/instance/user-data.txt`
H4	`#cloud-config` YAML invalid — cloud-init silently no-ops	`cloud-init schema --system`
H5	Launch Template v6 has empty UserData field	`describe-launch-template-versions`

Quick path

Pull the rendered user-data from IMDS — if it's empty, the LT is the bug.
If it's present, check cloud-init status --long and journalctl -u cloud-final.
Windows: Get-Content C:\ProgramData\Amazon\EC2Launch\log\agent.log -Tail 200

Cheeky: IMDS exposes the rendered user-data at http://169.254.169.254/latest/user-data. If it's wrong there, the LT is wrong. If it's right there but didn't run, it's the AMI.

LabEC2 lifecycleS003 · 4/8

Diagnose

Linux

# IMDSv2 token first
TOKEN=$(curl -s -X PUT http://169.254.169.254/latest/api/token \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")

# Rendered user-data
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
   http://169.254.169.254/latest/user-data | head -40

# cloud-init status + log
cloud-init status --long
sudo journalctl -u cloud-final --no-pager | tail -100

# Look for stale semaphores baked into AMI
ls -la /var/lib/cloud/sem/
ls -la /var/lib/cloud/instance/

Windows

# EC2Launch v2 task state
Get-Service AmazonSSMAgent
Get-Content "C:\ProgramData\Amazon\EC2Launch\log\agent.log" `
   -Tail 200

# Has UserData been marked “run-once”?
Test-Path "C:\ProgramData\Amazon\EC2Launch\state\.run-once"

# Re-arm UserData for next boot
& "C:\Program Files\Amazon\EC2Launch\EC2Launch.exe" reset --schedule

# Check the rendered user-data
Invoke-WebRequest -Headers @{"X-aws-ec2-metadata-token"=$tk} `
   -Uri "http://169.254.169.254/latest/user-data"

SimEC2 lifecycleS003 · 5/8

Root cause

What we found

The AMI was baked with Packer. The bake step ran the user-data during the bake (to pre-install agents). Packer didn't clean /var/lib/cloud/sem/ before aws ec2 create-image.
cloud-init on next boot saw the semaphore for “already-ran-this-instance” and skipped scripts-user.
Windows side: same idea but EC2Launch v2's .run-once flag survived because the bake skipped EC2Launch.exe sysprep.

Gotcha: custom AMI bakes that run user-data during the bake (a common “burn-in” pattern) must clean cloud-init state and Windows EC2Launch state before the image snapshot, or no instance launched from the AMI ever runs user-data again.

IaCEC2 lifecycleS003 · 6/8

Fix — Packer cleanup + Terraform AMI gate

Packer provisioner (Linux)

# last provisioner before snapshot
provisioner "shell" {
  inline = [
    "sudo cloud-init clean --logs",
    "sudo rm -rf /var/lib/cloud/sem/* /var/lib/cloud/instance",
    "sudo rm -f /etc/machine-id && sudo touch /etc/machine-id",
    "sudo truncate -s 0 /etc/hostname",
    "sudo rm -rf /root/.ssh /home/ec2-user/.ssh"
  ]
}

Packer provisioner (Windows)

provisioner "powershell" {
  inline = [
    "& 'C:\\Program Files\\Amazon\\EC2Launch\\EC2Launch.exe' reset",
    "& 'C:\\Program Files\\Amazon\\EC2Launch\\EC2Launch.exe' sysprep --shutdown"
  ]
}

Terraform AMI promotion gate

# in tools-cicd: promotion job
resource "aws_ssm_parameter" "prod_ami_id" {
  name  = "/gc/prod/ami/orders-api"
  type  = "String"
  value = data.aws_ami.candidate.id
  lifecycle { precondition {
    condition     = data.aws_ami.candidate.tags["PackerCleanupRun"] == "true"
    error_message = "AMI must be tagged PackerCleanupRun=true."
  } }
}

IaC gate: the bake job tags the AMI with PackerCleanupRun=true only after the cleanup step. Promotion to prod refuses without that tag.

ConceptEC2 lifecycleS003 · 7/8

Cheeky & prevention

Cheeky #1

Force user-data to re-run on next boot via SSM — no console:
aws ssm send-command --document-name AWS-RunShellScript --parameters 'commands=["sudo cloud-init clean --logs && sudo cloud-init init"]'

Cheeky #2

Switch the bake from “run user-data” to a cfn-init-style metadata pull. Move agent installs into Image Builder components — AMI ships ready, user-data only does instance-specific config.

Cheeky #3

Test in the bake: add a Packer post-processor that launches the AMI in a sandbox subnet with a probe user-data; if probe doesn't run, fail the build.

Prevent #1

EventBridge rule on EC2 Instance State-change Notification with state=running and a Lambda that probes IMDS user-data & cloud-init status; emits CW custom metric UserDataExecuted=0/1.

Prevent #2

Config rule: ec2-instance-managed-by-systems-manager (catches the wider problem — if your bake breaks SSM agent registration too).

Prevent #3

Bake CI uploads a bake-report.json to a central bucket; the AMI promotion job validates the report contains cloud_init_clean: true.

LabEC2 lifecycleS003 · 8/8

Interactive lab

Lab S003: Diagnose user-data didn't runsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS004 · 1/8

Symptom — instance has the wrong identity from IMDS

Observed

App calls aws sts get-caller-identity from inside the instance and gets old role FedAdmin — not the expected orders-api-task-role.
Result: writes to S3 fail with AccessDenied; reads from a different bucket succeed.
Started after the team replaced the instance profile via Terraform.

Constraints

Launch path	ASG → Launch Template (just bumped to v7)
Old profile	`ip-orders-api-v1` with role `FedAdmin` (yes, sloppy)
New profile	`ip-orders-api-v2` with role `orders-api-task-role`
EC2 metadata cache	credentials cached by SDK for ~6 hr

Note: EC2's IMDS rotates creds 6 hours before expiry, but if the profile swap happens after launch, the previous role's creds may persist until rotation OR until the instance/agent is restarted.

VisualEC2 lifecycleS004 · 2/8

Profile vs role vs cached credentials

EC2 + IMDS IAM control plane app SDK cache divergence (the bug) ⓘ hover for detail

replace-iam-instance-profile-association per inst OR ASG instance-refresh with LT trigger Plus: SDK in-process cache must invalidate EC2 i-0xx + IMDS EC2 IMDSv2 profile attr role IMDS facts Role name = profile's only role (1:1) Creds rotate ~6h before expiry Profile flip propagates < 2s typical SCP requires HttpTokens=required (IMDSv2) Any client that does GET-only fails after SCP flip SDK / app process JVM cred provider caller-id IMDS ≠ SDK cached creds IMDS: orders-task-role SDK: FedAdmin (still cached) Fix: restart process OR force-rotate via reassoc + ASG triggers=["launch_template"] Three remediation patterns — pick by service shape Pattern A — ASG instance refresh on LT trigger Terraform: instance_refresh.triggers = ["launch_template"] Profile change in LT → ASG rolls fleet → new launches use new profile Cattle, not pets · safest in prod Pattern B — replace-association + restart aws ec2 replace-iam-instance-profile-association ... SSM Run Command: systemctl restart orders-api Use when fleet is small or refresh too disruptive Pattern C — SDK self-validation App on boot: aws sts get-caller-identity Compare ARN to expected; panic-exit if divergent ASG kills + replaces → aligns SDK with IMDS Synthetic monitor Lambda calls get-caller-identity on every host every 1 min Emits CloudWatch metric InstanceRoleId per host Alarm on divergence from expected for > 10 min EventBridge audit Rule on AssociateIamInstanceProfile + ReplaceIamInstanceProfileAssociation Notify Slack #iam-changes with caller + diff Config rule: ec2-instance-profile-attached + name regex

SimEC2 lifecycleS004 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Profile swap not yet applied to running instances	`describe-iam-instance-profile-associations`
H2	Profile applied, but SDK cached old creds	IMDS shows new role; SDK shows old
H3	Instance manually overrides creds via env	`printenv \| grep AWS_`
H4	App container is using a task role from ECS not EC2	`curl $AWS_CONTAINER_CREDENTIALS_RELATIVE_URI`
H5	Profile resource policy denies AssumeRole on new role	CloudTrail `AssumeRole` error

Cheeky: the name of the role in IMDS is the source of truth. curl http://169.254.169.254/latest/meta-data/iam/security-credentials/ returns the role currently associated with the instance profile. If that's wrong, IMDS hasn't flipped yet.

Gotcha: SDKs (especially older boto3, AWS Java v1) cache credentials in-process for the entire SDK lifetime once obtained, ignoring TTL refresh in some configs. Restart the app process or rotate the SDK cred provider.

LabEC2 lifecycleS004 · 4/8

Diagnose

# 1. What does the API say is associated?
aws ec2 describe-iam-instance-profile-associations \
  --filters Name=instance-id,Values=i-0xx \
  --query 'IamInstanceProfileAssociations[].{S:State,Arn:IamInstanceProfile.Arn}'

# 2. What does IMDS say?
TK=$(curl -s -X PUT http://169.254.169.254/latest/api/token \
   -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
curl -s -H "X-aws-ec2-metadata-token: $TK" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/

# 3. SDK says what?
aws sts get-caller-identity

# 4. Force-rotate by re-associating profile
ASSOC=$(aws ec2 describe-iam-instance-profile-associations \
  --filters Name=instance-id,Values=i-0xx \
  --query 'IamInstanceProfileAssociations[0].AssociationId' --output text)

aws ec2 replace-iam-instance-profile-association \
  --association-id $ASSOC \
  --iam-instance-profile Name=ip-orders-api-v2

# 5. Restart the app or SSM agent
sudo systemctl restart orders-api
sudo systemctl restart amazon-ssm-agent

# 6. Confirm
sleep 30 && aws sts get-caller-identity

Cheeky: if you must keep the app process alive, force credential refresh by setting AWS_EC2_METADATA_DISABLED=false + clearing the SDK's in-memory cache (boto3: session.get_credentials().refresh()).

SimEC2 lifecycleS004 · 5/8

Root cause

Why the wrong creds persisted

Terraform changed aws_iam_instance_profile from ip-orders-api-v1 to ip-orders-api-v2 on the launch template.
Existing instances kept the old association — LT changes affect new launches only.
The team did terraform apply and assumed the fleet refreshed. ASG only refreshes on instance-refresh or scale events.
The SDK in the long-running app process had cached creds for the old role for hours.

Gotcha: changing IAM roles via Terraform does not re-associate already-running instances. Either trigger an ASG instance-refresh, OR script replace-iam-instance-profile-association across the fleet.

IaCEC2 lifecycleS004 · 6/8

Fix — force fleet refresh on profile change

resource "aws_launch_template" "orders_api" {
  name_prefix = "orders-api-"
  iam_instance_profile { name = aws_iam_instance_profile.orders_v2.name }
  user_data = base64encode(templatefile("ud.sh.tftpl",{}))
  metadata_options {
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
  }
  tag_specifications { resource_type="instance"; tags=local.tags }
}

resource "aws_autoscaling_group" "orders_api" {
  ...
  launch_template { id=aws_launch_template.orders_api.id; version="$Latest" }
  instance_refresh {
    strategy = "Rolling"
    preferences { min_healthy_percentage = 90 }
    triggers = ["launch_template"]   # <-- key
  }
}

Why `triggers = ["launch_template"]` matters

By default ASG instance-refresh only triggers on AMI changes.
Adding "launch_template" means any LT version bump (incl. instance profile) auto-rolls the fleet.
Combined with min_healthy_percentage = 90, the rollout is safe.

IaC note: remember to lifecycle-block force_delete; some shops also gate on a checkov rule that requires http_tokens = "required" (IMDSv2) on every launch template — matches the SCP guardrail.

ConceptEC2 lifecycleS004 · 7/8

Cheeky & prevention

Cheeky #1

Use aws sts get-caller-identity output as the source of truth in app boot logs. If the assumed role doesn't match expected, panic-exit the process — let ASG kill and replace.

Cheeky #2

One profile, multiple roles? Not possible. Instance profiles take exactly one role. Use STS chain (orders-api-bootstrap-role → orders-api-task-role) for runtime privilege downgrade.

Cheeky #3

Set SDK AWS_METADATA_SERVICE_TIMEOUT=2 + AWS_METADATA_SERVICE_NUM_ATTEMPTS=3 so credential rotation issues fail loudly, not silently.

Prevent

Synthetic canary calls get-caller-identity every minute, emits a CW metric InstanceRoleId. Alarm if it diverges from expected for > 10 min.

Prevent

EventBridge rule on AssociateIamInstanceProfile + ReplaceIamInstanceProfileAssociation → Slack #iam-changes.

Prevent

Config rule: ec2-instance-profile-attached, plus a custom rule that asserts the profile name matches expected per environment tag.

LabEC2 lifecycleS004 · 8/8

Interactive lab

Lab S004: Find the stuck instance profilesimulated

Objectives

0 / 0

ConceptEC2 lifecycleS005 · 1/8

Symptom — `Client.InvalidAMIID.NotFound`

Observed

ASG can't scale: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation.
Same AMI ID worked yesterday. Console UI shows AMI in source acct but not in spoke.
Recent change: tools-cicd team rotated AMI bake pipeline; old AMIs deregistered.

Constraints

Item	Value
Source acct	`gc-tools-cicd` (1212...)
Spoke acct	`gc-prod-app` (6666...)
AMI ID	`ami-0abc123def456`
Region scope	us-east-1 only (no copy to eu-west-1)

Note: AMI IDs are per-region per-owner. The same AMI copied to eu-west-1 will have a different ID.

VisualEC2 lifecycleS005 · 2/8

AMI sharing topology

Source AMI lifecycle Spoke ASG (consumer) SSM Parameter Store (the fix) deregistered (the bug) ⓘ hover for detail

SimEC2 lifecycleS005 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	AMI deregistered in source acct	`aws ec2 describe-images --owners 1212 --image-ids ami-0xx`
H2	Launch permission revoked for spoke	`describe-image-attribute --attribute launchPermission`
H3	Wrong region (LT in eu-west-1 referencing us-east-1 AMI)	region in LT vs ASG
H4	Encrypted snapshot share missing (AMI is encrypted)	`describe-snapshot-attribute`

Cheeky: the InvalidAMIID error message is the same whether the AMI never existed, was deregistered, or your account lacks permission. Add CloudTrail lookup of DeregisterImage in the source account to disambiguate fast.

LabEC2 lifecycleS005 · 4/8

Diagnose

# 1. Does the AMI exist for the source owner?
aws ec2 describe-images --owners 121212121212 \
  --image-ids ami-0abc123def456 \
  --query 'Images[].{ID:ImageId,State:State,Name:Name}' \
  --output table

# 2. Was it deregistered?
aws --profile gc-tools cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=ami-0abc123def456 \
  --max-results 5 --query 'Events[].{T:EventTime,N:EventName,U:Username}'

# 3. Resolve via SSM parameter (what should happen)
aws ssm get-parameter --name /gc/prod/ami/orders-api \
  --query Parameter.Value --output text

# 4. Check launch permission
aws --profile gc-tools ec2 describe-image-attribute \
  --image-id ami-0abc123def456 --attribute launchPermission

# 5. Check snapshot share for encrypted AMI
aws --profile gc-tools ec2 describe-snapshot-attribute \
  --snapshot-id snap-0xx --attribute createVolumePermission

SimEC2 lifecycleS005 · 5/8

Root cause

Story

The bake pipeline retains the latest 5 AMIs and deregisters the rest.
An ASG launch template was pinned (hard-coded) to the AMI ID, not an SSM parameter.
When the pipeline rotated, the pinned AMI was deregistered.
ASG tried to scale; RunInstances exploded with InvalidAMIID.NotFound.

Gotcha: “the AMI exists, just check the console” — if you're looking at the source account, you may see it because deregistered AMIs vanish from the spokeview first while still showing in the owning account for a while.

IaCEC2 lifecycleS005 · 6/8

Fix — SSM-parameter-driven AMI ID

data "aws_ssm_parameter" "orders_ami" {
  name = "/gc/prod/ami/orders-api"
}

resource "aws_launch_template" "orders_api" {
  image_id = data.aws_ssm_parameter.orders_ami.value
  ...
}

# in tools-cicd: write parameter on every promotion
resource "aws_ssm_parameter" "prod_ami" {
  name      = "/gc/prod/ami/orders-api"
  type      = "String"
  data_type = "aws:ec2:image"
  value     = aws_ami_copy.candidate.id
  overwrite = true
}

Cross-account read of the parameter

Use RAM to share the parameter to spoke accounts (or cross-account permissions on the SSM parameter for SSM Advanced — not available for Standard).
Spokes reference via shared parameter; Terraform provider supports this with a cross-account aws alias.

IaC note: data_type = "aws:ec2:image" makes SSM validate the AMI ID format and existence at write-time — you can't accidentally write a typo.

ConceptEC2 lifecycleS005 · 7/8

Cheeky & prevention

Cheeky #1

EC2 LaunchTemplate accepts resolve:ssm:/gc/prod/ami/orders-api directly in image_id — no data source needed.

Cheeky #2

Keep the prior AMI: bake step writes /gc/prod/ami/orders-api/previous. Roll-back is one parameter version flip + ASG instance-refresh.

Cheeky #3

For DR region, mirror the parameter via Lambda in tools-cicd that runs on parameter change and copies to eu-west-1 with the eu-west-1 AMI ID.

Prevent

EventBridge on DeregisterImage; if the deregistered AMI is referenced by any LT (search via Config), page the team.

Prevent

Bake pipeline keeps last 5 AMIs plus any AMI referenced by a non-deleted LT (cross-account introspection).

Prevent

ASG instance-refresh with auto-rollback on health failure: rolling out a bad AMI auto-reverts.

LabEC2 lifecycleS005 · 8/8

Interactive lab

Lab S005: Trace the missing AMIsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS006 · 1/8

Symptom — instance has no public IP

Observed

Bastion in DMZ subnet launches but only has private IP. No public IP assigned.
Vendor partner can't SSH from internet via the EIP that was supposed to be auto-attached.

Constraints

Subnet	`subnet-dmz-use1a` (10.20.0.0/24)
Auto-assign IP	was set; SCP recently flipped it off org-wide
EIP allocation	requested in user-data via `aws ec2 associate-address`
Instance role	lacks `ec2:AssociateAddress`

VisualEC2 lifecycleS006 · 2/8

How “public IP” actually works

Public IP options SCP guardrail where it fails better pattern (SSM) ⓘ hover for detail

must run associate-address itself → instance role needs ec2:AssociateAddress Where it fails — sequence EC2 bastion role missing perm EIP unassociated vendor blocked Failure invariants in this org 1. SCP forces explicit EIP path 2. user-data must call associate-address itself 3. role permission must include ec2:AssociateAddress 4. user-data must set -euo pipefail + probe at end 5. EventBridge alarm if PublicIpAddress null after running Recommended: replace public bastion with SSM Session Manager + port forwarding vendor SAML SSM tunnel bastion target Why this is the right call + no public IP → SCP & security review trivial + no SSH key distribution to vendors + every session logged in CloudTrail + session can be terminated centrally + no EIP quota issues + works across regions transparently − vendor needs aws-cli + Okta access − partner integrations may not support SSM (then use EIP+ENI)

SimEC2 lifecycleS006 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Subnet auto-assign disabled (SCP)	`describe-subnets` MapPublicIpOnLaunch
H2	EIP not associated — user-data role missing perm	cloud-init log + IAM SimulatePrincipalPolicy
H3	EIP exhausted — account quota	`describe-account-attributes`
H4	EIP allocated in different region	`describe-addresses --region eu-west-1`

Cheeky: use aws iam simulate-principal-policy --policy-source-arn <role> --action-names ec2:AssociateAddress --resource-arns * to prove perm without running anything.

LabEC2 lifecycleS006 · 4/8

Diagnose

# 1. Subnet flag
aws ec2 describe-subnets --subnet-ids subnet-dmz-use1a \
  --query 'Subnets[].{Auto:MapPublicIpOnLaunch,IPs:AvailableIpAddressCount}'

# 2. Instance state
aws ec2 describe-instances --instance-ids i-0xx \
  --query 'Reservations[].Instances[].{Pub:PublicIpAddress,Priv:PrivateIpAddress}'

# 3. EIP available?
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==`null`].PublicIp'

# 4. Did user-data fail silently?
sudo cat /var/log/cloud-init-output.log | grep -i associate

# 5. IAM perm proof
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::666...:role/bastion-role \
  --action-names ec2:AssociateAddress \
  --resource-arns arn:aws:ec2:us-east-1:666...:elastic-ip/eipalloc-0xx

# 6. Manually associate (fix attempt)
aws ec2 associate-address \
  --instance-id i-0xx --allocation-id eipalloc-0xx

SimEC2 lifecycleS006 · 5/8

Root cause

Org SCP NoAutoPublicIp denies RunInstances with AssociatePublicIpAddress=true on ENIs.
User-data tried to associate-address but the bastion role only had ec2:DescribeAddresses, not ec2:AssociateAddress.
User-data didn't fail-fast on error (no set -e); the instance reached running with no public IP and no alarm.

Gotcha: associate-address requires perm on the EIP allocation and on the instance ENI. Forgetting the ENI ARN is a frequent IAM cause.

IaCEC2 lifecycleS006 · 6/8

Fix

resource "aws_iam_role_policy" "bastion_eip" {
  role = aws_iam_role.bastion.id
  policy = jsonencode({
    Version="2012-10-17",
    Statement=[{
      Effect="Allow",
      Action=["ec2:AssociateAddress","ec2:DisassociateAddress",
              "ec2:DescribeAddresses"],
      Resource="*",
      Condition={
        StringEquals={"aws:ResourceTag/Role"="bastion"}
      }
    }]
  })
}

# user-data hardening
provisioner "file" {
  content = <<-EOT
    #!/usr/bin/env bash
    set -euo pipefail
    TOKEN=$(curl -s -X PUT \
      http://169.254.169.254/latest/api/token \
      -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
    INST=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
      http://169.254.169.254/latest/meta-data/instance-id)
    aws ec2 associate-address \
      --instance-id $INST --allocation-id ${var.eip_alloc_id}
  EOT
}

IaC note: use set -euo pipefail and probe at end — aws ec2 describe-instances --instance-ids $INST --query 'Reservations[].Instances[].PublicIpAddress' — if blank, exit 1 → ASG kills.

ConceptEC2 lifecycleS006 · 7/8

Cheeky & prevention

Cheeky #1

For bastions, prefer SSM Session Manager with port-forwarding — no public IP, no SSH key, fully audited.

Cheeky #2

Tag the EIP with Role=bastion + InstanceTag=bastion-prod. Use IAM aws:ResourceTag condition to scope ec2:AssociateAddress to only EIPs you own.

Cheeky #3

Avoid auto-assign public IP at the subnet level for any production tier — it's implicit and easy to leak. Always EIP+explicit assoc.

Prevent

EventBridge on EC2 Instance State-change Notification · running + Lambda asserts PublicIpAddress != null for tagged bastions.

Prevent

Synthetic canary: every 1 min, attempt nc -vz from external runner to bastion EIP:22; alarm on failure.

Prevent

Config rule: elastic-ip-required-tags + custom rule that flags any unassociated EIP > 1 day old (cost).

LabEC2 lifecycleS006 · 8/8

Interactive lab

Lab S006: Make the bastion reachablesimulated

Objectives

0 / 0

ConceptEC2 lifecycleS007 · 1/8

Symptom — stop/start broke internal DNS

Observed

App was patched: stop → start. Came back with new private IP (10.20.10.99 → 10.20.10.121).
Microservice orders-api.gcaws.internal still resolves to old IP for ~10 minutes.
Downstream services 5xx with connection refused.

Constraints

DNS	R53 PHZ `gcaws.internal` (associated to prod-app VPC)
Record	A `orders-api` → literal IP, TTL 60
Update path	manual today; nobody updated the record
App	Java app, DNS cached forever (default sec.policy)

Note: Java caches DNS based on networkaddress.cache.ttl; default is JVM · until process restart. Add -Dsun.net.inetaddr.ttl=60.

VisualEC2 lifecycleS007 · 2/8

Why stop/start gets a new IP

before fix (stale state) R53 + JVM cache better patterns ⓘ hover for detail

SimEC2 lifecycleS007 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	PHZ record stale	`list-resource-record-sets` → compare to instance IP
H2	Client DNS cache (Java)	`jcmd <pid> VM.system_properties \| grep ttl`
H3	Connection pool pinned to old socket	app metric / process restart fixes
H4	NLB cross-zone disabled, target re-registration delayed	`describe-target-health`

Cheeky: for any “DNS cache” suspicion, getent hosts orders-api.gcaws.internal on the host queries the resolver directly. If it's right but the app sees old IP, it's the JVM/SDK cache.

LabEC2 lifecycleS007 · 4/8

Diagnose

# 1. Current PHZ record
aws route53 list-resource-record-sets \
  --hosted-zone-id Z0XXX \
  --query "ResourceRecordSets[?Name=='orders-api.gcaws.internal.']"

# 2. Current instance IP
aws ec2 describe-instances --instance-ids i-0xx \
  --query 'Reservations[].Instances[].PrivateIpAddress'

# 3. From inside the host
getent hosts orders-api.gcaws.internal
dig +short orders-api.gcaws.internal

# 4. Update the record (immediate fix)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z0XXX --change-batch file://upsert.json

# 5. Force JVM to re-resolve (cheeky)
sudo systemctl restart orders-api    # cleanest
# or via JMX:
jcmd <pid> VM.system_properties | grep -i ttl

# 6. Confirm
ss -tnp | grep orders-api    # new sockets to right IP

SimEC2 lifecycleS007 · 5/8

Root cause

The team used a hand-managed PHZ A record pointing at a literal private IP.
Stop/start changed the IP. Nobody updated DNS.
Even after updating, the Java app cached the old DNS for the JVM lifetime.

Gotcha: A-records to instance IPs are an anti-pattern at any scale. Service discovery (ECS / Cloud Map / R53 ARC), an ALB target group, or a secondary ENI is the right primitive.

IaCEC2 lifecycleS007 · 6/8

Fix — ALB or auto-update Lambda

Path A: front with ALB / NLB

resource "aws_lb" "orders" { internal=true; load_balancer_type="application"; ... }
resource "aws_lb_target_group" "orders" { ... }
resource "aws_route53_record" "orders" {
  zone_id = data.aws_route53_zone.gcaws.zone_id
  name    = "orders-api"
  type    = "A"
  alias { name = aws_lb.orders.dns_name; zone_id = aws_lb.orders.zone_id; evaluate_target_health = true }
}

Path B: auto-update via EventBridge + Lambda

resource "aws_cloudwatch_event_rule" "ec2_state" {
  event_pattern = jsonencode({
    source      = ["aws.ec2"],
    detail-type = ["EC2 Instance State-change Notification"],
    detail      = { state=["running"] }
  })
}
resource "aws_lambda_function" "phz_updater" { ... }
# Lambda reads instance tag DnsName, upserts PHZ record

IaC note: Path A is preferred. Path B is the “single instance, no LB” escape hatch (e.g., bastion-style services). Both are codified.

ConceptEC2 lifecycleS007 · 7/8

Cheeky & prevention

Cheeky #1

JVM DNS cache fix without app restart: java.security.Security.setProperty("networkaddress.cache.ttl","60") at boot, or env-level JAVA_OPTS=-Dsun.net.inetaddr.ttl=60.

Cheeky #2

Avoid stop/start on prod EC2 entirely — replace the instance via ASG instance-refresh. Cattle, not pets.

Cheeky #3

Need a stable IP without LB? Attach a secondary ENI you provision separately. ENI persists; primary IP is on the ENI; the ENI moves with the instance.

Prevent

Synthetic canary on every named PHZ entry — periodically validates DNS-vs-target IP. Alarm on divergence > 5 min.

Prevent

Config rule: route53-records-only-pointing-to-running-resources (custom).

Prevent

SCP doesn't directly help. Lint rule in Terraform: forbid aws_route53_record with type=A and records=[] — force ALB-alias.

LabEC2 lifecycleS007 · 8/8

Interactive lab

Lab S007: Find the stale DNS, fix it, prove cachesimulated

Objectives

0 / 0

ConceptEC2 lifecycleS008 · 1/8

Symptom — `InsufficientInstanceCapacity`

Observed

ASG can't scale: We currently do not have sufficient c6i.4xlarge capacity in the AZ you requested (us-east-1a).
ASG configured with single AZ + single instance type (legacy).
Happens at 9am traffic peak on weekdays.

Constraints

ASG AZs	`us-east-1a` only (legacy)
Instance type	`c6i.4xlarge` only
Capacity reservation	none
SCP region lock	us-east-1, eu-west-1

Note: AWS doesn't guarantee on-demand capacity per-AZ per-type. The fix is diversification (mixed-instances) or ODCR (On-Demand Capacity Reservation).

VisualEC2 lifecycleS008 · 2/8

Capacity diversification topology

pinned (broken) diversified (fix) ODCR floor ⓘ hover for detail

SimEC2 lifecycleS008 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Genuine AZ capacity shortage at peak	StateReason + EventBridge ASG events
H2	Account-level on-demand vCPU quota hit	Service Quotas: `L-1216C47A`
H3	Subnet IP exhausted (looks similar)	`describe-subnets` AvailableIpAddressCount
H4	SCP denies new types beyond approved list	simulate run-instances

Cheeky: ICE errors are AZ-specific. Try the same instance type via aws ec2 run-instances --dry-run in each AZ — you'll see which AZ has capacity right now.

LabEC2 lifecycleS008 · 4/8

Diagnose

# 1. ASG scaling activity
aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name orders-asg --max-records 10 \
  --query 'Activities[].{T:StartTime,S:StatusCode,M:StatusMessage}'

# 2. Quota
aws service-quotas get-service-quota \
  --service-code ec2 --quota-code L-1216C47A

# 3. Subnet IPs
aws ec2 describe-subnets --subnet-ids subnet-priv-use1a \
  --query 'Subnets[].AvailableIpAddressCount'

# 4. Probe other AZs (dry-run)
for az in us-east-1a us-east-1b us-east-1c; do
  echo $az
  aws ec2 run-instances --dry-run --instance-type c6i.4xlarge \
    --image-id ami-0xx --subnet-id $(subnet_for $az) \
    --query Errors --output text 2>&1 | head -2
done

# 5. ODCR check
aws ec2 describe-capacity-reservations \
  --filters Name=state,Values=active \
  --query 'CapacityReservations[].{T:InstanceType,AZ:AvailabilityZone,Avail:AvailableInstanceCount}'

SimEC2 lifecycleS008 · 5/8

Root cause

Genuine AWS capacity shortage in us-east-1a for c6i.4xlarge at 9am peak (regional event affecting many tenants).
ASG was pinned to 1 AZ + 1 type for “deterministic placement” (legacy, no longer needed).
No fallback type, no ODCR, no spot.

Gotcha: “Use larger types for headroom” backfires — bigger types are more capacity-constrained. Diversify down (more, smaller) or across families.

IaCEC2 lifecycleS008 · 6/8

Fix — mixed-instances + ODCR

resource "aws_autoscaling_group" "orders" {
  vpc_zone_identifier = local.private_subnets_3az
  min_size=4; desired=8; max_size=40
  mixed_instances_policy {
    launch_template { launch_template_specification {
      launch_template_id = aws_launch_template.orders.id; version="$Latest"
    } }
    instances_distribution {
      on_demand_base_capacity                  = 4
      on_demand_percentage_above_base_capacity = 50
      spot_allocation_strategy                 = "capacity-optimized"
    }
    override { instance_type = "c6i.4xlarge" }
    override { instance_type = "c6a.4xlarge" }
    override { instance_type = "c5.4xlarge" }
    override { instance_type = "m6i.4xlarge" }
  }
  capacity_reservation_specification {
    capacity_reservation_preference = "open"
  }
}

resource "aws_ec2_capacity_reservation" "orders_floor" {
  instance_type = "c6i.4xlarge"
  instance_platform = "Linux/UNIX"
  availability_zone = "us-east-1a"
  instance_count    = 4
  end_date_type     = "unlimited"
  instance_match_criteria = "open"
  tags = local.tags
}

IaC note: Mixed-instances + ODCR is the classic “floor + diversification” pattern. Floor for SLO floor; mixed-instances for burst.

ConceptEC2 lifecycleS008 · 7/8

Cheeky & prevention

Cheeky #1

Use attribute-based instance type selection (InstanceRequirements) instead of explicit type list — AWS picks any matching family; broadest capacity pool.

Cheeky #2

Spot Placement Score API tells you which region/AZ has best spot capacity right now for your shape — pre-flight checker for big batch jobs.

Cheeky #3

Reserve 4 ODCR seats. ASG burst beyond into on-demand, then spot. ICE in spot tier doesn't kill SLO because the floor is reserved.

Prevent

CW alarm on ASG metric GroupPendingInstances > 0 for 5 min → PagerDuty.

Prevent

Config + custom rule: ASGs in prod must specify mixed_instances_policy with at least 3 overrides.

Prevent

Annual capacity review in Q4 — quotas raised, ODCR sized to next year traffic forecast.

LabEC2 lifecycleS008 · 8/8

Interactive lab

Lab S008: Trace ICE and propose fixsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS009 · 1/8

Symptom — SCP denies `RunInstances` for missing `CostCenter`

Observed

Terraform apply fails with UnauthorizedOperation: with an explicit deny in a service control policy.
Console launch by FedAdmin also fails — same SCP applies.
Other instances in the account work fine — only new launches fail.

Constraints

SCP	`RequireTags` on Workloads OU
Required tags	`CostCenter`, `Owner`, `Env`
Tag enforcement	at `RunInstances` via `aws:RequestTag/CostCenter`
Bypass	FedAdmin role does not bypass SCP

Note: SCPs apply to all identities in member accounts, including the account root and admins. The only way around an SCP is to remove/change it at the org level.

VisualEC2 lifecycleS009 · 2/8

How SCP tag enforcement evaluates

RunInstances call structure SCP evaluation where it fails fix path ⓘ hover for detail

locals.tags module — one source for all resources tflint custom rule (CI gate) plugin "aws" { ... } rule "aws_resource_missing_tags" { enabled = true tags = ["CostCenter","Owner","Env"] } Fails PR before SCP can deny in prod — faster feedback Bonus: Lambda enrichment for SCP errors EventBridge listens for UnauthorizedOperation; Lambda parses Resource ARN, looks up SCP, posts ‘you missed CostCenter on volume’ to dev's Slack — turns generic deny into actionable hint.

SimEC2 lifecycleS009 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Tag missing entirely	compare Terraform plan to SCP
H2	Tag on instance but not on volume	review `tag_specifications`
H3	Case mismatch	SCP `aws:RequestTag/CostCenter` is case-sensitive
H4	Tag value not in allowed set (tag policy)	`describe-organizations-policies`
H5	SCP applies to OU; account moved recently	`list-parents` + `list-policies-for-target`

Cheeky: aws iam simulate-principal-policy doesn't evaluate SCPs. Use IAM Access Analyzer policy validation + AWS Organizations list-policies-for-target to spot which SCPs are in scope before debugging.

LabEC2 lifecycleS009 · 4/8

Diagnose

# 1. Show the failing API call from CloudTrail
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances \
  --max-results 1 --query 'Events[].CloudTrailEvent' \
  | jq '.[0] | fromjson | {errorCode, errorMessage}'

# 2. Pull SCPs in scope
aws --profile gc-mgmt organizations list-policies-for-target \
  --target-id ou-xxx --filter SERVICE_CONTROL_POLICY

# 3. Validate Terraform tag plan
terraform show -json tfplan | jq '.. | objects | select(.tag_specifications) | .tag_specifications'

# 4. Test directly
aws ec2 run-instances --image-id ami-0xx --instance-type t3.micro \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=CostCenter,Value=ENG-100},{Key=Owner,Value=alice},{Key=Env,Value=dev}]' \
  'ResourceType=volume,Tags=[{Key=CostCenter,Value=ENG-100}]' \
  --dry-run

SimEC2 lifecycleS009 · 5/8

Root cause

Terraform default_tags set CostCenter on the provider but the older AWS provider didn't propagate to Volume on RunInstances (only on the instance resource).
SCP RequireTags evaluates each TagSpecification separately — volume tag was missing → explicit deny.
Error message says “explicit deny in SCP” without naming the missing tag — misleading.

Gotcha: Terraform AWS provider < v5.0 doesn't propagate default_tags to all sub-resources. Upgrade or set tag_specifications explicitly per resource type.

IaCEC2 lifecycleS009 · 6/8

Fix — explicit tags everywhere

resource "aws_launch_template" "orders" {
  ...
  tag_specifications { resource_type="instance"; tags=local.tags }
  tag_specifications { resource_type="volume";   tags=local.tags }
  tag_specifications { resource_type="network-interface"; tags=local.tags }
}

provider "aws" {
  region = "us-east-1"
  default_tags { tags = local.tags }
}

locals {
  tags = {
    CostCenter = "ENG-100"
    Owner      = "orders-team"
    Env        = "prod"
  }
}

Lint as a guardrail

# tflint plugin: aws-ruleset
plugin "aws" { enabled=true; version="0.30.0"; source="github.com/terraform-linters/tflint-ruleset-aws" }
rule "aws_resource_missing_tags" {
  enabled = true
  tags    = ["CostCenter", "Owner", "Env"]
}

IaC note: tflint runs in CI, fails the PR before SCP can deny in prod. Same tag list lives in tflint config + SCP — consider generating both from one HCL file.

ConceptEC2 lifecycleS009 · 7/8

Cheeky & prevention

Cheeky #1

Use tag policies (separate from SCP) to enforce case: tag_key: { @@assign: "CostCenter" }. Org standardizes “CostCenter” (not costcenter).

Cheeky #2

SCP message is generic. Add a Lambda that listens on UnauthorizedOperation CloudTrail events, parses the SCP, posts the missing-tag hint to the developer.

Cheeky #3

Pre-merge: terraform plan + parse JSON, check every tag_specifications contains required tags. Fail PR with the missing tag named.

Prevent

Config rule: required-tags across resource types; non-compliance → auto-tag (where allowed) or remediation Lambda.

Prevent

Service Catalog product abstracts tag handling so app teams can't forget.

Prevent

Per-account README pre-commit hook: requires CostCenter in locals.tf.

LabEC2 lifecycleS009 · 8/8

Interactive lab

Lab S009: Find which tag the SCP wantssimulated

Objectives

0 / 0

ConceptEC2 lifecycleS010 · 1/8

Symptom — SCP denies launch unless IMDSv2 required

Observed

Legacy launch template still has HttpTokens=optional (allows IMDSv1).
SCP denies RunInstances when ec2:MetadataHttpTokens != required.
Old AMI also has SDK code that uses IMDSv1 only.

Constraints

SCP applied to all Workloads OU accounts.
Org-wide policy: IMDSv2 required from 2024-09.
App written in 2017 with old AWS SDK.

Note: IMDSv2 = session-token + TTL. Mitigates SSRF that pulls credentials.

VisualEC2 lifecycleS010 · 2/8

IMDSv1 vs IMDSv2 flow

IMDSv1 (vulnerable) IMDSv2 (secure) SCP enforcement audit metric ⓘ hover for detail

SimEC2 lifecycleS010 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	LT has `http_tokens=optional`	`describe-launch-template-versions`
H2	App SDK is too old for IMDSv2	SDK version table check
H3	Container hop-limit not 2 (ECS/k8s)	`http_put_response_hop_limit`
H4	SCP not applied to this account — some other deny	`list-policies-for-target`

Cheeky: AWS publishes a CloudWatch metric MetadataNoToken per instance — non-zero means something is still doing IMDSv1. Hunt those before flipping the SCP.

LabEC2 lifecycleS010 · 4/8

Diagnose

# 1. LT current setting
aws ec2 describe-launch-template-versions \
  --launch-template-id lt-0xx --versions '$Latest' \
  --query 'LaunchTemplateVersions[].LaunchTemplateData.MetadataOptions'

# 2. Per-instance audit
aws ec2 describe-instances --instance-ids i-0xx \
  --query 'Reservations[].Instances[].MetadataOptions'

# 3. Find IMDSv1 callers across fleet
aws cloudwatch get-metric-statistics --namespace AWS/EC2 \
  --metric-name MetadataNoToken --dimensions Name=InstanceId,Value=i-0xx \
  --start-time -1h --end-time now --period 60 --statistics Sum

# 4. Update LT (new version)
aws ec2 create-launch-template-version \
  --launch-template-id lt-0xx \
  --source-version '$Latest' \
  --launch-template-data '{
    "MetadataOptions":{
      "HttpTokens":"required",
      "HttpEndpoint":"enabled",
      "HttpPutResponseHopLimit":2
    }
  }'

# 5. Live-modify existing instances
aws ec2 modify-instance-metadata-options --instance-id i-0xx \
  --http-tokens required --http-put-response-hop-limit 2

SimEC2 lifecycleS010 · 5/8

Root cause

Legacy LT was forked from a 2019 base; http_tokens defaulted to optional (IMDSv1+v2).
SCP added 2024-09 enforces required → new launches fail.
Fix is two parts: LT & in-app SDK upgrade so v2 works.

Gotcha: setting http_tokens=required immediately breaks any app still using IMDSv1. Run the audit first; flip after no MetadataNoToken metric for 7 days.

IaCEC2 lifecycleS010 · 6/8

Fix

resource "aws_launch_template" "orders" {
  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"   # IMDSv2
    http_put_response_hop_limit = 2
    instance_metadata_tags      = "enabled"
  }
}

Org-wide checkov rule

# .checkov.yml
check:
  - CKV_AWS_79  # EC2 should require IMDSv2
  - CKV_AWS_341 # LT hop_limit <= 2

IaC note: shift left — checkov in PR catches the violation before SCP does. Faster feedback to devs.

ConceptEC2 lifecycleS010 · 7/8

Cheeky & prevention

Cheeky #1

Run a fleet-wide modify-instance-metadata-options in a maintenance window via SSM Automation document. No restart needed.

Cheeky #2

For containerized workloads with hop limit issues: set hop limit to 1 if IMDS shouldn't reach pods (most secure), or 2 if needed for ECS task role pickup.

Cheeky #3

Use instance metadata tags (instance_metadata_tags=enabled) so apps can read tags without IAM perms — great for cost-center decoration in logs.

Prevent

CW dashboard tracks org-wide MetadataNoToken sum; alarm if any account has >0 over rolling 7 days.

Prevent

Config rule ec2-imdsv2-check flags non-compliant instances/LTs.

Prevent

SCP also denies ec2:ModifyInstanceMetadataOptions with HttpTokens=optional in request — can't weaken once enforced.

LabEC2 lifecycleS010 · 8/8

Interactive lab

Lab S010: Audit + flip IMDSv2simulated

Objectives

0 / 0

ConceptEC2 lifecycleS011 · 1/8

Symptom — restored EBS volume won't attach

Observed

DR drill: restore RDS not feasible, instead restore an EBS snapshot to attach to a new instance.
aws ec2 attach-volume fails: InvalidVolume.ZoneMismatch or just times out.
Sometimes AccessDenied on KMS for the volume created in another region.

Constraints

Snapshot owned by gc-prod-data (us-east-1).
DR region: eu-west-1.
Volume must attach to instance in eu-west-1c.
Encrypted with regional KMS CMK (us-east-1 only).

VisualEC2 lifecycleS011 · 2/8

Snapshot copy + KMS re-encrypt path

source region (us-east-1) DR region (eu-west-1) AZ mismatch (the bug) DR runbook (SSM) ⓘ hover for detail

SimEC2 lifecycleS011 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Volume in different AZ than instance	`describe-volumes` AZ vs instance AZ
H2	KMS key region mismatch	`describe-volume` KmsKeyId region
H3	Snapshot not yet completed	`describe-snapshots` Progress
H4	Snapshot not shared cross-acct	`describe-snapshot-attribute` create-volume-permission

Cheeky: use EBS Fast Snapshot Restore when DR drilling — pre-warms the volume so first reads aren't I/O-throttled.

LabEC2 lifecycleS011 · 4/8

Diagnose

# 1. Volume + instance AZ
aws --region eu-west-1 ec2 describe-volumes \
  --volume-ids vol-0xx --query 'Volumes[].{AZ:AvailabilityZone,KMS:KmsKeyId}'
aws --region eu-west-1 ec2 describe-instances \
  --instance-ids i-0xx --query 'Reservations[].Instances[].Placement.AvailabilityZone'

# 2. Re-create volume in correct AZ
aws --region eu-west-1 ec2 create-volume \
  --snapshot-id snap-dst --availability-zone eu-west-1c \
  --volume-type gp3 --encrypted --kms-key-id alias/eu-data

# 3. Snapshot progress
aws --region eu-west-1 ec2 describe-snapshots \
  --snapshot-ids snap-dst --query 'Snapshots[].{P:Progress,S:State,K:KmsKeyId}'

# 4. Cross-account share check
aws --profile gc-prod-data ec2 describe-snapshot-attribute \
  --snapshot-id snap-src --attribute createVolumePermission

# 5. Attach
aws --region eu-west-1 ec2 attach-volume \
  --volume-id vol-new --instance-id i-0xx --device /dev/sdf

SimEC2 lifecycleS011 · 5/8

Root cause

Snapshot copy from us-east-1 → eu-west-1 used account default CMK in eu-west-1, not the team's named key.
Volume created in eu-west-1a (default), instance launched in eu-west-1c (chosen for capacity) → AZ mismatch.
The DR runbook didn't pin AZ explicitly.

Gotcha: EBS volumes are AZ-locked. Always create the volume in the same AZ as the target instance, or move the instance to where the volume lives.

IaCEC2 lifecycleS011 · 6/8

Fix — codify the DR restore

# SSM Automation document (Terraform-managed)
resource "aws_ssm_document" "dr_restore_volume" {
  name          = "GC-DR-RestoreEBS"
  document_type = "Automation"
  content       = file("docs/dr-restore-ebs.yaml")
}

# dr-restore-ebs.yaml (excerpt)
parameters:
  SnapshotId: { type: String }
  TargetAz:   { type: String, default: "eu-west-1c" }
  KmsKeyId:   { type: String, default: "alias/eu-data" }
mainSteps:
  - name: copy
    action: aws:executeAwsApi
    inputs: { Service: ec2, Api: CopySnapshot, ... }
  - name: wait
    action: aws:waitForAwsResourceProperty
  - name: create_volume
    action: aws:executeAwsApi
    inputs: { Api: CreateVolume,
      AvailabilityZone: "{{ TargetAz }}" }

# DLM lifecycle policy
resource "aws_dlm_lifecycle_policy" "orders_data" {
  description = "orders-data daily snap + DR copy"
  state       = "ENABLED"
  policy_details {
    schedule {
      cross_region_copy_rule {
        target = "eu-west-1"
        encrypted = true
        cmk_arn   = aws_kms_alias.eu_data.target_key_arn
        retain_rule { interval = 30; interval_unit = "DAYS" }
      }
    }
  }
}

IaC note: DLM with cross-region copy + named KMS key + DR restore SSM doc — the whole DR path is reproducible. Annual game-day exercises validate.

ConceptEC2 lifecycleS011 · 7/8

Cheeky & prevention

Cheeky #1

Tag the snapshot with SourceVolumeAz; the DR doc reads it and tries to match the target AZ first.

Cheeky #2

Use EBS Multi-Attach (io1/io2) only with apps that support distributed locking; otherwise corruption.

Cheeky #3

Convert old gp2 to gp3 for free baseline IOPS bump — one modify-volume call, no downtime.

Prevent

Quarterly DR game-day uses the SSM Automation doc end-to-end. Failure auto-creates Jira ticket.

Prevent

Config rule: ebs-snapshot-public-restorable-check + custom rule for cross-region copy presence.

Prevent

Backup & DR tag enforced via SCP — instances missing BackupPolicy tag get denied at launch.

LabEC2 lifecycleS011 · 8/8

Interactive lab

Lab S011: Restore EBS in DR regionsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS012 · 1/8

Symptom — app OOM seconds after launch

Observed

Java app crashes with OOM 30s after instance running.
EC2 type c6i.xlarge (4 vCPU, 8 GiB).
Root volume 8 GiB — full after agent installs.
No swap configured.

Constraints

Java -Xmx not set; defaults to 25% of memory.
SCP forbids ebs-optimized=false.
Org default ebs-encryption-by-default=true.

VisualEC2 lifecycleS012 · 2/8

Memory + disk picture

broken layout recommended layout OOM-killer pathway ⓘ hover for detail

SimEC2 lifecycleS012 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Root volume too small / full	`df -h /`
H2	JVM heap default too small/large	`jcmd VM.flags`
H3	EBS burst credits exhausted (gp2)	CW `BurstBalance` metric
H4	OOM-killer killed app, not Java OOM	`dmesg \| grep -i oom`

Cheeky: “Java OOM” vs “Linux OOM” are different. JVM throws OutOfMemoryError; kernel logs oom-kill. Always check dmesg + journalctl -k.

LabEC2 lifecycleS012 · 4/8

Diagnose

# 1. Disk + memory
df -h /
free -m
swapon --show

# 2. JVM flags
sudo -u app jps -l
sudo -u app jcmd <pid> VM.flags | grep -E 'MaxHeap|MinHeap|UseG1'
sudo -u app jcmd <pid> VM.system_properties | grep mx

# 3. EBS burst (gp2)
aws cloudwatch get-metric-statistics --namespace AWS/EBS \
  --metric-name BurstBalance --dimensions Name=VolumeId,Value=vol-0xx \
  --start-time -1h --end-time now --period 60 --statistics Minimum

# 4. Linux OOM-kill
sudo dmesg | grep -i 'killed process'
sudo journalctl -k --since "1 hour ago" | grep -i oom

SimEC2 lifecycleS012 · 5/8

Root cause

Root EBS 8 GiB full after CW Agent + SSM + Java + tmp.
Logs filled remaining space within minutes; jvm.log redirected to /var/log.
JVM tried to allocate heap; fork/native allocation hit ENOMEM → Linux OOM-kill (not Java OOM).

Gotcha: on small root volumes, journalctl + log rotation lag means /var fills quickly. Pin a separate volume for /var/log.

IaCEC2 lifecycleS012 · 6/8

Fix

resource "aws_launch_template" "orders" {
  ebs_optimized = true
  block_device_mappings {
    device_name = "/dev/xvda"
    ebs { volume_size=30; volume_type="gp3"; iops=3000; throughput=125; encrypted=true }
  }
  block_device_mappings {
    device_name = "/dev/sdb"
    ebs { volume_size=20; volume_type="gp3"; encrypted=true }   # /var/log
  }
  user_data = base64encode(file("ud.sh"))   # mounts + JVM tuning
}

# ud.sh excerpt
mkfs.xfs /dev/nvme1n1
mount /dev/nvme1n1 /var/log
echo "/dev/nvme1n1 /var/log xfs defaults,nofail 0 2" >> /etc/fstab

# swap
fallocate -l 2G /swapfile && chmod 600 /swapfile
mkswap /swapfile && swapon /swapfile

# JVM
echo 'JAVA_OPTS="-Xms4g -Xmx5g -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError"' \
  >> /etc/orders-api.env

IaC note: use volume_type=gp3 uniformly; gp2's burst credits are a frequent source of mysterious p99 spikes.

ConceptEC2 lifecycleS012 · 7/8

Cheeky & prevention

Cheeky #1

For containerized apps: ditch swap; cap memory at the cgroup level. Java 11+ honors cgroup memory automatically.

Cheeky #2

Use tmpfs for /tmp with size cap — prevents tmp file bombs from filling root.

Cheeky #3

Pre-warm gp3: 3000 IOPS / 125 MB/s baseline is free; bump to 5000 IOPS at $0.005 per IOP-hour. Cheap p99 win.

Prevent

CW agent installs diskspace+swap custom metrics; alarm on FilesystemUsedPct > 80 for / and /var/log.

Prevent

SSM Compliance state pack — instance must report log-rotate active.

Prevent

Pre-deploy unit test: spin instance with target user-data in sandbox; run app + chaos load; assert no OOM in first 5 min.

LabEC2 lifecycleS012 · 8/8

Interactive lab

Lab S012: Find what killed the appsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS013 · 1/8

Symptom — spot interruption causes ALB 5xx burst

Observed

ALB target group sees ~30s of 5xx during spot interruption.
Instance vanishes with no graceful drain.
Customer error rate breached SLO.

Constraints

Spot fleet 60% of ASG.
2-min interruption notice via IMDS.
ALB deregistration delay default 300s.

VisualEC2 lifecycleS013 · 2/8

Spot interruption flow vs graceful drain

without hook (5xx) with hook + NTH signal sources ⓘ hover for detail

SimEC2 lifecycleS013 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	No ASG lifecycle hook for terminate	`describe-lifecycle-hooks`
H2	Hook exists but no handler subscribed	EventBridge target wired?
H3	ALB `deregistration_delay` too long, instance gone before drain	TG attribute
H4	Health check passes but pool keeps dead conns	Keep-alive timeout vs idle

Cheeky: use the AWS Node Termination Handler on EC2 (or NTH for k8s) — pulls IMDS interruption notice and drains for you.

LabEC2 lifecycleS013 · 4/8

Diagnose

# 1. Lifecycle hooks on the ASG
aws autoscaling describe-lifecycle-hooks --auto-scaling-group-name orders-asg

# 2. ALB TG drain
aws elbv2 describe-target-group-attributes --target-group-arn arn:...:targetgroup/orders/...
# expect deregistration_delay.timeout_seconds <= 60 for most apps

# 3. Spot interruption history
aws ec2 describe-spot-instance-requests --filters Name=state,Values=closed

# 4. Listen for interruption from inside instance
TOKEN=$(curl -s -X PUT http://169.254.169.254/latest/api/token \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action

# 5. Run NTH locally
sudo systemctl status aws-node-termination-handler

SimEC2 lifecycleS013 · 5/8

Root cause

ASG had no terminate lifecycle hook for spot interruption.
EC2 was yanked at T-2min; ALB kept routing for ~30s.
App didn't shed connections proactively.

Gotcha: spot interruption is faster than ASG termination — EC2 Spot Instance Interruption Warning fires at T-2min, but the instance dies at T-0 regardless of ASG hook delay.

IaCEC2 lifecycleS013 · 6/8

Fix — ASG hook + NTH + ALB tune

resource "aws_autoscaling_lifecycle_hook" "terminate" {
  name                   = "orders-terminate"
  autoscaling_group_name = aws_autoscaling_group.orders.name
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  default_result         = "CONTINUE"
  heartbeat_timeout      = 90
  notification_target_arn = aws_sns_topic.lifecycle.arn
  role_arn               = aws_iam_role.lifecycle.arn
}
resource "aws_lb_target_group" "orders" {
  ...
  deregistration_delay = 30
}

# NTH on each instance (DaemonSet for k8s, systemd for plain EC2)
provisioner "file" {
  destination = "/etc/systemd/system/aws-node-termination-handler.service"
  content     = file("nth.service")
}

IaC note: NTH listens on IMDS for interruption + ASG lifecycle SQS; calls deregister + waits drain; completes lifecycle action.

ConceptEC2 lifecycleS013 · 7/8

Cheeky & prevention

Cheeky #1

Set connection_termination=true on NLB TGs — existing flows are reset on deregister, faster recovery.

Cheeky #2

Mix on-demand base + spot for bursty workloads; SLO floor never on spot.

Cheeky #3

Use capacity-optimized-prioritized spot allocation if order matters; lowers interruption rate.

Prevent

CW alarm: HTTPCode_ELB_5XX_Count > baseline + 3sigma during spot events.

Prevent

Chaos game day: trigger fake interruption via describe-spot-fleet-request-history sim; assert 0 5xx.

Prevent

Spot Placement Score > 7 required by Terraform pre-flight check.

LabEC2 lifecycleS013 · 8/8

Interactive lab

Lab S013: Detect interruption + drainsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS014 · 1/8

Symptom — ASG using stale LT version

Observed

Old AMI behavior in fresh instances even after promotion.
ASG $Default points to v3; v5 is latest.
Console shows v5 marked Default but actual launches use v3.

Constraints

ASG launch_template.version = "$Default" in Terraform.
Manual set-default-version happened in console.
Drift between code and live state.

VisualEC2 lifecycleS014 · 2/8

$Latest vs $Default vs explicit version

$Default (drift risk) $Latest (churn) explicit version (recommended) drift detection ⓘ hover for detail

SimEC2 lifecycleS014 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	$Default not bumped to v5	`describe-launch-templates` DefaultVersion
H2	ASG uses pinned version, not $Default	`describe-auto-scaling-groups` LaunchTemplate.Version
H3	Console-edited LT outside Terraform	diff Terraform state

Cheeky: always pin to $Latest + use ASG instance_refresh.triggers=["launch_template"]. Terraform updates LT, ASG auto-rolls.

LabEC2 lifecycleS014 · 4/8

Diagnose

# 1. LT versions
aws ec2 describe-launch-template-versions \
  --launch-template-id lt-0xx \
  --query 'LaunchTemplateVersions[].{V:VersionNumber,D:DefaultVersion,I:LaunchTemplateData.ImageId}'

# 2. ASG launch config
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names orders-asg \
  --query 'AutoScalingGroups[].LaunchTemplate'

# 3. Force ASG to v5 explicit
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name orders-asg \
  --launch-template LaunchTemplateId=lt-0xx,Version=5

# 4. Trigger refresh
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name orders-asg \
  --preferences MinHealthyPercentage=90,InstanceWarmup=120

SimEC2 lifecycleS014 · 5/8

Root cause

Terraform set launch_template { version = "$Default" }.
Promotion script created v5 but didn't call set-default-version.
$Default still v3; ASG still launches v3.

Gotcha: Terraform considers $Default a static string, doesn't track LT version drift — manual changes don't register as drift.

IaCEC2 lifecycleS014 · 6/8

Fix

resource "aws_autoscaling_group" "orders" {
  launch_template {
    id      = aws_launch_template.orders.id
    version = aws_launch_template.orders.latest_version  # pin explicit
  }
  instance_refresh {
    strategy = "Rolling"
    triggers = ["launch_template"]
    preferences { min_healthy_percentage = 90; instance_warmup = 120 }
  }
}

IaC note: using latest_version attribute makes Terraform track every LT bump. Combined with instance refresh trigger, every PR rolls the fleet automatically.

CI drift check

# nightly cron in CI
terraform plan -refresh-only -detailed-exitcode
# exit 2 = drift; raise issue

ConceptEC2 lifecycleS014 · 7/8

Cheeky & prevention

Cheeky #1

Tag every LT version with PromotedAt; promotion job blocks promotion of versions older than 30 days — forces fresh bakes.

Cheeky #2

Use checkpoint instance refresh: roll a small percent first, observe metrics, continue.

Cheeky #3

If you must use $Default, add a Lambda that asserts DefaultVersion == latest_version daily — closes the drift gap.

Prevent

SCP denies ec2:ModifyLaunchTemplate in prod accounts — only CI role can change.

Prevent

Tags on LT version (BuildSha, BuildAt) so post-mortems can identify which LT version a misbehaving instance came from.

Prevent

EventBridge rule on ModifyLaunchTemplate outside CI role → alert.

LabEC2 lifecycleS014 · 8/8

Interactive lab

Lab S014: Find which LT version ASG actually usessimulated

Objectives

0 / 0

ConceptEC2 lifecycleS015 · 1/8

Symptom — user-data fetch races with role attach

Observed

User-data calls aws secretsmanager get-secret-value.
Random failures: ~10% of launches return UnrecognizedClientException: The security token included in the request is invalid.
Retrying the same instance 30s later succeeds.

Constraints

SCP requires IMDSv2.
Role attached at LT.
Instance profile propagation eventually consistent (1–2 sec usually).

VisualEC2 lifecycleS015 · 2/8

Boot order race

race window (the bug) wait-for-iam pattern SSM Run Command alternative ⓘ hover for detail

SimEC2 lifecycleS015 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	IMDS not yet returning role creds	add wait-for-creds loop, observe
H2	Network not up yet (race with eth0)	cloud-init `cloud-init.target` ordering
H3	VPCe DNS not yet resolving	`getent hosts secretsmanager.us-east-1...`
H4	Time skew — SigV4 fails	`chronyc sources`

Cheeky: use aws sts get-caller-identity as a probe. Loop until it succeeds; then call SecretsManager.

LabEC2 lifecycleS015 · 4/8

Diagnose

# 1. Confirm the race
sudo grep -E 'UnrecognizedClient|InvalidSignatureException' /var/log/cloud-init-output.log

# 2. Test from instance after boot
for i in 1 2 3; do
  aws sts get-caller-identity || echo retry; sleep 1
done

# 3. Wait pattern in user-data
until aws sts get-caller-identity >/dev/null 2>&1; do sleep 2; done
SECRET=$(aws secretsmanager get-secret-value --secret-id orders-prod \
  --query SecretString --output text)

SimEC2 lifecycleS015 · 5/8

Root cause

EC2 reports running before instance profile credentials propagate to IMDS.
User-data scripts that hit AWS APIs during the first few seconds occasionally hit the gap.
SDK retries throttling but not UnrecognizedClient by default.

Gotcha: instance profile creds are usually < 1s but can be longer in busy regions. Always wait-for-creds before sensitive calls.

IaCEC2 lifecycleS015 · 6/8

Fix

# ud.sh template
#!/usr/bin/env bash
set -euo pipefail
TOK=$(curl -s -X PUT http://169.254.169.254/latest/api/token \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
for i in {1..30}; do
  if aws sts get-caller-identity >/dev/null 2>&1; then break; fi
  sleep 2
done
SECRET=$(aws secretsmanager get-secret-value --secret-id ${secret_id} \
  --query SecretString --output text)

IaC note: ship a shared wait-for-iam.sh across all repos. Single source for the wait loop — never re-derive.

Or move secret pull to SSM Run Command

# EventBridge on EC2 running → SSM doc → pull secret

ConceptEC2 lifecycleS015 · 7/8

Cheeky & prevention

Cheeky #1

Use SSM Parameter Store for non-secret bootstrap config — same race avoidance, simpler IAM.

Cheeky #2

For Windows, the EC2Launch v2 task graph supports dependencies — ensure secret-pull task waits on aws-cli-ready.

Cheeky #3

Use a CW alarm on user-data failures — metric filter on cloud-init log shipping.

Prevent

cloud-init unit ordering: After=cloud-init.target + Wants=instance-meta.target.

Prevent

Bake the wait-for-iam loop into Image Builder component; user-data never repeats it.

Prevent

Synthetic test: launch test instance every 4h, assert no UnrecognizedClient in logs.

LabEC2 lifecycleS015 · 8/8

Interactive lab

Lab S015: Add the wait-for-iam loopsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS016 · 1/8

Symptom — tag policy case mismatch

Observed

Some instances appear in console as “non-compliant” under Tag Policy view.
Tag value is Production, policy wants prod.
Resources still launch — tag policies don't deny by default.

Constraints

Org tag policy enforces Env values: [prod, stg, dev].
Old IaC writes Production.
Drift surfaces in Tag Policy compliance, not in CloudTrail errorCode.

VisualEC2 lifecycleS016 · 2/8

Tag policy vs SCP

Tag policy (advisory by default) SCP (enforcement) combo: SCP + Tag policy + Config case mismatch (the bug) ⓘ hover for detail

SimEC2 lifecycleS016 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Stale Terraform writes wrong case	grep `Production`
H2	Tag policy not actually enforced	`describe-policy` + `enforced_for`
H3	Auto-tagging Lambda overwrites	CloudTrail TagResource events

Cheeky: Tag policies have a quirky inheritance with operators (@@assign, @@append, @@enforced_for). Always check effective-policy at the OU/account level — not the policy doc.

LabEC2 lifecycleS016 · 4/8

Diagnose

# 1. Effective tag policy at account level
aws --profile gc-mgmt organizations describe-effective-policy \
  --policy-type TAG_POLICY --target-id 666666666666

# 2. Find non-compliant resources
aws resourcegroupstaggingapi get-resources \
  --tag-filters Key=Env,Values=Production
# compare to allowed: prod / stg / dev

# 3. Bulk re-tag
aws resourcegroupstaggingapi tag-resources \
  --resource-arn-list arn:aws:ec2:...:instance/i-0xx \
  --tags Env=prod

# 4. Tag policy compliance summary
aws --profile gc-audit config describe-compliance-by-config-rule \
  --config-rule-names required-tags

SimEC2 lifecycleS016 · 5/8

Root cause

Tag policy enforces Env values {prod, stg, dev}.
Old IaC pinned Env=Production → non-compliant but not blocked.
Compliance signal accumulated until audit team flagged.

Gotcha: tag policies are case-sensitive on values. Production ≠ prod.

IaCEC2 lifecycleS016 · 6/8

Fix

locals {
  tags = merge({
    Env        = "prod"
    CostCenter = "ENG-100"
    Owner      = "orders-team"
  }, var.extra_tags)
}

provider "aws" {
  default_tags { tags = local.tags }
}

# tag policy strict mode
{
  "tags": {
    "Env": {
      "tag_key": { "@@assign": "Env" },
      "tag_value": { "@@assign": ["prod", "stg", "dev"] },
      "enforced_for": { "@@assign": ["ec2:instance", "rds:db"] }
    }
  }
}

IaC note: enforced_for with resource type list converts tag policy from advisory to enforced — tag-aware ops will fail.

ConceptEC2 lifecycleS016 · 7/8

Cheeky & prevention

Cheeky #1

Add a Lambda that auto-remediates: on TagResource event, lowercase the value if matches enum.

Cheeky #2

Use Resource Groups with tag filters as the source of truth for “all prod EC2” — surfaces non-compliant tags fast.

Cheeky #3

tflint custom rule: deny Env values not in [prod, stg, dev].

Prevent

Tag policy in enforced_for mode + tflint: catch at PR-time and at API-time.

Prevent

Quarterly compliance review pulled from Tag Policy compliance API.

Prevent

One-line Terraform module everyone consumes: module "stdtags".

LabEC2 lifecycleS016 · 8/8

Interactive lab

Lab S016: Find non-compliant Env valuessimulated

Objectives

0 / 0

ConceptEC2 lifecycleS017 · 1/8

Symptom — subnet IPs exhausted

Observed

ASG can't scale: InsufficientFreeAddressesInSubnet.
Subnet is /27 with 3 IPs free out of 32.
Many ENIs are detached but not released.

Constraints

Subnet originally sized for 4 instances.
EKS pods consume secondary IPs aggressively.
Lambda VPC ENIs hold onto IPs.

VisualEC2 lifecycleS017 · 2/8

Where IPs go in a /27

IP exhaustion (the bug) EKS / Lambda IP consumers fix: secondary CIDR + prefix delegation ⓘ hover for detail

SimEC2 lifecycleS017 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Subnet truly full	`describe-subnets` AvailableIpAddressCount
H2	EKS warm pool grabbing IPs	`describe-network-interfaces` by Description
H3	Detached ENIs held by Lambda VPC / DLM	`describe-network-interfaces` Status=available
H4	ECS tasks awaitingENI's, never deleted	ECS service events

Cheeky: add a secondary CIDR to the VPC and migrate IP-heavy subnets there — no downtime, no resubnet.

LabEC2 lifecycleS017 · 4/8

Diagnose

# 1. IP availability
aws ec2 describe-subnets --filters Name=vpc-id,Values=vpc-0xx \
  --query 'Subnets[].{ID:SubnetId,AZ:AvailabilityZone,Free:AvailableIpAddressCount,CIDR:CidrBlock}' \
  --output table

# 2. Who holds the ENIs?
aws ec2 describe-network-interfaces \
  --filters Name=subnet-id,Values=subnet-priv-use1a \
  --query 'NetworkInterfaces[].{S:Status,D:Description,O:Attachment.InstanceOwnerId}' \
  --output table

# 3. Free orphaned ENIs
aws ec2 delete-network-interface --network-interface-id eni-0xx

# 4. Add secondary CIDR + new subnet
aws ec2 associate-vpc-cidr-block --vpc-id vpc-0xx \
  --cidr-block 100.64.0.0/16
aws ec2 create-subnet --vpc-id vpc-0xx \
  --cidr-block 100.64.10.0/22 --availability-zone us-east-1a

# 5. Switch EKS to prefix delegation (more IPs/instance)
kubectl set env -n kube-system ds aws-node ENABLE_PREFIX_DELEGATION=true

SimEC2 lifecycleS017 · 5/8

Root cause

Subnet sized at /27 for 4 instances.
EKS warm pool consumed all secondary IPs without prefix delegation.
Lambda VPC Hyperplane ENI sat in subnet for 40 min after last invocation.

Gotcha: shrink subnets in code feels right but production VPC subnets are immutable in size. Use secondary CIDR.

IaCEC2 lifecycleS017 · 6/8

Fix

resource "aws_vpc_ipv4_cidr_block_association" "secondary" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "100.64.0.0/16"
}
resource "aws_subnet" "private_carrier" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("100.64.0.0/16", 6, count.index)
  availability_zone = local.azs[count.index]
  tags = merge(local.tags, { Tier="private-carrier", KubernetesCarrier="true" })
}

# enable prefix delegation in EKS
resource "aws_eks_addon" "vpc_cni" {
  cluster_name = aws_eks_cluster.main.name
  addon_name   = "vpc-cni"
  configuration_values = jsonencode({
    env = { ENABLE_PREFIX_DELEGATION = "true" }
  })
}

IaC note: 100.64/10 (RFC 6598) is non-routable on the public internet but routable inside VPC; ideal for “extra IPs” without conflicting with corp 10/8.

ConceptEC2 lifecycleS017 · 7/8

Cheeky & prevention

Cheeky #1

Use VPC IPAM for centralized IP planning — alerts before exhaustion at OU scale.

Cheeky #2

For Lambda VPC, set EFS_DEPENDENCY_CHECK false + use VPC Lattice to bypass ENIs altogether.

Cheeky #3

Tag every detached ENI with OrphanCheck=true + Lambda cleans after 1h.

Prevent

CW alarm on subnet free IPs < 20%.

Prevent

Subnet sizing standard: never /27 in prod for EKS/ECS — minimum /22.

Prevent

Quarterly IP capacity review per VPC — growth forecast vs IPAM.

LabEC2 lifecycleS017 · 8/8

Interactive lab

Lab S017: Find who's holding IPssimulated

Objectives

0 / 0

ConceptEC2 lifecycleS018 · 1/8

Symptom — ASG instance-refresh stuck

Observed

Instance refresh stops at 30%; one instance won't terminate.
API: OperationNotPermitted: The instance has termination protection.
Someone manually enabled it during an incident, never removed.

Constraints

ASG auto-scaling protections also apply.
Two flags: instance-level DisableApiTermination + ASG protected_from_scale_in.

VisualEC2 lifecycleS018 · 2/8

Three places “protection” lives

EC2 termination protect (the bug here) ASG scale-in protect Stop API protect resolution + auto-cleanup ⓘ hover for detail

SimEC2 lifecycleS018 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	EC2 DisableApiTermination=true	`describe-instance-attribute --attribute disableApiTermination`
H2	ASG instance protect from scale-in	`describe-auto-scaling-instances`
H3	Lifecycle hook stuck waiting	`describe-lifecycle-hooks`

Cheeky: instance-refresh respects DisableApiTermination; you must clear it on the protected instance OR use --skip-matching if AMI is identical anyway.

LabEC2 lifecycleS018 · 4/8

Diagnose

# 1. Check both flags
aws ec2 describe-instance-attribute --instance-id i-0xx \
  --attribute disableApiTermination

aws autoscaling describe-auto-scaling-instances \
  --instance-ids i-0xx \
  --query 'AutoScalingInstances[].ProtectedFromScaleIn'

# 2. Disable EC2 termination protection
aws ec2 modify-instance-attribute --instance-id i-0xx \
  --no-disable-api-termination

# 3. Disable ASG scale-in protect
aws autoscaling set-instance-protection \
  --instance-ids i-0xx \
  --auto-scaling-group-name orders-asg \
  --no-protected-from-scale-in

# 4. Resume refresh
aws autoscaling resume-processes \
  --auto-scaling-group-name orders-asg

SimEC2 lifecycleS018 · 5/8

Root cause

During an incident, an SRE flipped DisableApiTermination=true on a known-good instance — to ensure ASG didn't kill it during diagnosis.
Forgot to remove the flag.
Next deploy, instance-refresh ran, hit the flag, halted.

Gotcha: any operational pinning should be ticketed with auto-cleanup. Use a tag like OpsHold=true + nightly Lambda that warns + removes after 24h.

IaCEC2 lifecycleS018 · 6/8

Fix

# Tag-driven cleanup
resource "aws_lambda_function" "ops_hold_cleanup" {
  function_name = "gc-ops-hold-cleanup"
  ...
}
resource "aws_cloudwatch_event_rule" "daily" {
  schedule_expression = "cron(0 8 * * ? *)"
}

# Lambda body (excerpt)
for inst in ec2.describe_instances(Filters=[{Name='tag:OpsHold',Values=['true']}]):
    age = now - inst.tags.get('OpsHoldSet')
    if age > 24h:
        ec2.modify_instance_attribute(InstanceId=inst.id, DisableApiTermination=False)
        ec2.modify_instance_attribute(InstanceId=inst.id, Tags={'OpsHold':'cleared'})
        slack('cleared OpsHold on '+inst.id)

IaC note: automation prevents leftover state from hardening into mystery failures during the next deploy.

ConceptEC2 lifecycleS018 · 7/8

Cheeky & prevention

Cheeky #1

Instance refresh --skip-matching ignores instances already on the right LT version — a workaround when one stuck pet exists.

Cheeky #2

Use warm pool for fast scale-up; pool members aren't in service so refresh issues isolate.

Cheeky #3

SSM Automation doc GC-ClearOpsHold — one click clears all flags + tags.

Prevent

EventBridge on ModifyInstanceAttribute with DisableApiTermination=true → tag instance + Slack.

Prevent

Pre-deploy gate: describe-auto-scaling-group shows no instance with protection > 0; if any, fail deploy.

Prevent

Runbook: don't use termination protection on cattle. Use ASG protected_from_scale_in for the rare case.

LabEC2 lifecycleS018 · 8/8

Interactive lab

Lab S018: Find & clear the protectionsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS019 · 1/8

Symptom — auto-recovery never fires

Observed

Instance went unhealthy due to underlying host issue. CW alarm StatusCheckFailed_System never fired the recovery action.
Alarm in INSUFFICIENT_DATA for the last 6h.

Constraints

Alarm action: arn:aws:automate:us-east-1:ec2:recover.
EC2 type c6i.4xlarge supports recovery.
Metric data missing: instance not reporting.

VisualEC2 lifecycleS019 · 2/8

Why metrics stop and alarm goes INSUFFICIENT_DATA

healthy state failure path (current) fixed alarm config auto-recovery + ASG fallback ⓘ hover for detail

SimEC2 lifecycleS019 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	treat_missing_data not `breaching`	`describe-alarms`
H2	Recovery action wrong ARN	compare to AWS-supplied recovery ARN
H3	Instance type doesn't support recovery	check supported list
H4	Alarm in different region	region check

Cheeky: EC2 Auto Recovery works only if the instance type uses EBS-only storage. Instance store types can't auto-recover.

LabEC2 lifecycleS019 · 4/8

Diagnose

# 1. Alarm definition
aws cloudwatch describe-alarms \
  --alarm-names orders-recover-i-0xx \
  --query 'MetricAlarms[].{T:TreatMissingData,A:AlarmActions,P:DatapointsToAlarm,E:EvaluationPeriods}'

# 2. Last 6h metric data
aws cloudwatch get-metric-statistics --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_System \
  --dimensions Name=InstanceId,Value=i-0xx \
  --start-time -6h --end-time now --period 60 --statistics Maximum

# 3. Fix the alarm
aws cloudwatch put-metric-alarm \
  --alarm-name orders-recover-i-0xx \
  --metric-name StatusCheckFailed_System --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0xx \
  --statistic Maximum --period 60 --threshold 0 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 5 --datapoints-to-alarm 3 \
  --treat-missing-data breaching \
  --alarm-actions arn:aws:automate:us-east-1:ec2:recover

SimEC2 lifecycleS019 · 5/8

Root cause

Alarm treat_missing_data=missing — default. When metrics stop, alarm goes INSUFFICIENT_DATA, no action.
Auto-recovery requires alarm to enter ALARM; can never get there with missing data treated as missing.

Gotcha: the “missing data” trap is the most common alarm misconfig in AWS. Always pick breaching for failure-detection alarms.

IaCEC2 lifecycleS019 · 6/8

Fix

resource "aws_cloudwatch_metric_alarm" "recover" {
  for_each = toset(var.instance_ids)
  alarm_name          = "recover-${each.value}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  datapoints_to_alarm = 3
  metric_name         = "StatusCheckFailed_System"
  namespace           = "AWS/EC2"
  period              = 60
  statistic           = "Maximum"
  threshold           = 0
  treat_missing_data  = "breaching"
  alarm_actions       = ["arn:aws:automate:us-east-1:ec2:recover"]
  dimensions = { InstanceId = each.value }
}

IaC note: alarms-per-instance via for_each — or use EC2 instance auto-recovery (default behavior) which doesn't require explicit alarms on supported types.

Even simpler

resource "aws_instance" "x" {
  maintenance_options { auto_recovery = "default" }
}

ConceptEC2 lifecycleS019 · 7/8

Cheeky & prevention

Cheeky #1

Use maintenance_options.auto_recovery=default — AWS handles it without alarms.

Cheeky #2

Pair recovery with a CW alarm on StatusCheckFailed_Instance → reboot action; covers OS hangs that aren't host failures.

Cheeky #3

For ASG, prefer health-check replace: set ASG health_check_type=ELB; ASG kills + replaces unhealthy instances faster than recovery.

Prevent

CW alarm meta-monitor: alarm on any alarm in INSUFFICIENT_DATA > 30 min.

Prevent

Config rule: cloudwatch-alarm-action-check + custom check on treat_missing_data.

Prevent

Annual game day: simulate host failure (stop force) and assert recovery completes < 5 min.

LabEC2 lifecycleS019 · 8/8

Interactive lab

Lab S019: Fix the alarmsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS020 · 1/8

Symptom — instance invisible to SSM

Observed

SSM Console shows 0 managed instances for the new fleet.
aws ssm describe-instance-information returns empty.
SSM Session Manager fails with Target not found.

Constraints

SSM agent installed on AMI — should auto-start.
Instance has IAM role with AmazonSSMManagedInstanceCore.
VPC has VPC endpoints for ssm/ssmmessages/ec2messages.

VisualEC2 lifecycleS020 · 2/8

What SSM agent needs to register

required: 3 VPCe + IAM + agent private DNS off (the bug) fix path ⓘ hover for detail

SimEC2 lifecycleS020 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	SSM agent not running	`systemctl status amazon-ssm-agent`
H2	IAM role missing or perm	`describe-instance-attribute --attribute iamInstanceProfile`
H3	VPC endpoint SG denies 443 from instance	SG ingress rules
H4	VPCe DNS resolution off	`private_dns_enabled`
H5	Time skew breaks SigV4	`chronyc sources`

Cheeky: curl -v https://ssm.us-east-1.amazonaws.com from the instance — if it resolves to a 10.x address, VPCe is in play. If a public IP, NAT path. Either should give 403 — that's good (TLS works).

LabEC2 lifecycleS020 · 4/8

Diagnose

# 1. From instance (via console direct connect or get-system-log)
sudo systemctl status amazon-ssm-agent
sudo journalctl -u amazon-ssm-agent --no-pager | tail -50
sudo cat /var/log/amazon/ssm/amazon-ssm-agent.log | tail -100

# 2. Reach the endpoint
getent hosts ssm.us-east-1.amazonaws.com
curl -v https://ssm.us-east-1.amazonaws.com 2>&1 | head -10

# 3. From console
aws ssm describe-instance-information \
  --filters Key=InstanceIds,Values=i-0xx
# empty -> agent never registered

# 4. Inspect VPCe SG
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-0xx \
  --query 'VpcEndpoints[].{S:Groups,P:PrivateDnsEnabled,SG:Groups}'

SimEC2 lifecycleS020 · 5/8

Root cause

VPC endpoint for ssm had private_dns_enabled=false (someone disabled it for a debug last week).
ssm.us-east-1.amazonaws.com resolved to public IP, instance had no NAT → agent couldn't register.

Gotcha: when private DNS is disabled on an interface VPCe, you need to explicitly use the per-AZ DNS name (e.g. vpce-xxx.ssm.us-east-1.vpce.amazonaws.com). SSM agent doesn't support that path.

IaCEC2 lifecycleS020 · 6/8

Fix

resource "aws_vpc_endpoint" "ssm" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ssm"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = local.private_subnets
  security_group_ids  = [aws_security_group.vpce.id]
  private_dns_enabled = true   # <-- must be true
  tags                = local.tags
}
# repeat for ssmmessages, ec2messages

# SG for vpce
resource "aws_security_group_rule" "vpce_ingress" {
  type              = "ingress"
  from_port         = 443
  to_port           = 443
  protocol          = "tcp"
  source_security_group_id = aws_security_group.workload.id
  security_group_id = aws_security_group.vpce.id
}

IaC note: all 3 SSM endpoints needed (ssm, ssmmessages, ec2messages). Missing any one breaks Run Command or Session Manager.

ConceptEC2 lifecycleS020 · 7/8

Cheeky & prevention

Cheeky #1

SSM Fleet Manager can self-heal SSM agent state on managed instances — useful when agents drift.

Cheeky #2

Use SSM Default Host Management Configuration — auto-attaches the SSM role + agent to all EC2 in the account, no manual setup.

Cheeky #3

Ship CloudWatch Agent + SSM as a single Image Builder component; consistent across AMIs.

Prevent

Config rule: ec2-instance-managed-by-systems-manager; non-compliant means missing.

Prevent

Synthetic launch + SSM ping every hour from sandbox; alarm if registration takes > 5 min.

Prevent

Pre-deploy checklist gates: SSM ping, VPCe health, IAM role attached.

LabEC2 lifecycleS020 · 8/8

Interactive lab

Lab S020: Get the agent registeredsimulated

Objectives

0 / 0

ConceptEC2 lifecycleS021 · 1/8

Symptom — CW reboot races ASG kill

Observed

App stops responding; CW alarm fires reboot action.
~30s later, ASG ELB health check marks instance unhealthy and terminates.
Both happen — instance reboots and gets killed.
Net effect: 60s outage instead of 10s reboot.

Constraints

ASG health_check_type=ELB, grace 60s.
CW alarm reboot action on app metric.
No coordination between actions.

VisualEC2 lifecycleS021 · 2/8

Two healers fighting

healer A: CW reboot healer B: ASG ELB-health collision (the bug) fix: pick one ⓘ hover for detail

SimEC2 lifecycleS021 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Reboot during reboot terminates	CW + ASG events same instance
H2	Grace period too short	ASG `health_check_grace_period`
H3	Health check uses /health that goes 503 in shutdown phase	app shutdown logs

Cheeky: aws autoscaling describe-scaling-activities + CW alarm history side-by-side reveal who killed first.

LabEC2 lifecycleS021 · 4/8

Diagnose

# 1. ASG events
aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name orders-asg \
  --max-records 5

# 2. CW alarm history
aws cloudwatch describe-alarm-history \
  --alarm-name orders-app-reboot --max-records 10

# 3. Disable the dual healing
aws cloudwatch delete-alarms --alarm-names orders-app-reboot

# 4. Tune ASG grace
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name orders-asg \
  --health-check-grace-period 180

SimEC2 lifecycleS021 · 5/8

Root cause

Two automated healers (CW reboot + ASG ELB health) fired in sequence.
ASG grace expired during the reboot window → instance considered failed.
Result: avoidable 60s outage, ASG churn, and noisy alerts.

Gotcha: never have two automatic remediation paths on the same fault — they fight. Pick the cheaper-to-correct one (ASG replace) and remove the other.

IaCEC2 lifecycleS021 · 6/8

Fix

# Drop the reboot alarm; rely on ASG ELB health
# removed: aws_cloudwatch_metric_alarm.orders_app_reboot

resource "aws_autoscaling_group" "orders" {
  health_check_type         = "ELB"
  health_check_grace_period = 180
  ...
}

IaC note: cattle, not pets. If the app misbehaves, replace the instance — faster recovery, no state pollution. Reboot was a 2010-era pattern.

Health endpoint refinement

# app /health returns 200 when ready; 503 when draining
# ALB target group draining waits 30s

ConceptEC2 lifecycleS021 · 7/8

Cheeky & prevention

Cheeky #1

If reboot is necessary (kernel state), use ASG standby: put instance in standby, reboot, return to service. ASG won't terminate during standby.

Cheeky #2

Decouple readiness from liveness. ALB health checks readiness; ASG checks liveness via instance status checks. Less false positives.

Cheeky #3

Add CW alarm on the count of InstanceRefresh events; pages if ASG is churning.

Prevent

Audit: any CW alarm with action ec2:reboot or ec2:terminate on instances inside an ASG → warn.

Prevent

Service runbook: one canonical healer per failure mode.

Prevent

Game day: simulate hung app; assert single healer fires.

LabEC2 lifecycleS021 · 8/8

Interactive lab

Lab S021: Identify the dual-healer racesimulated

Objectives

0 / 0

Batch 2 · 20 scenarios · Security Groups & NACLs

Security Groups, NACLs & stateful gotchas

SG vs NACL semantics, ephemeral ports, cross-VPC SG references via RAM, prefix-list explosions, ENI per-SG limits, drift from console edits, restricted-tier NACLs blocking VPCe replies. All diagrams use the AWS standard iconography from now on.

S022 → S041 160 slides 20 lab terminals

ConceptSG/NACLS022 · 1/8

Symptom — SG allows SSH but bastion still cannot connect

Observed

SSH from corp jumphost to bastion 10.20.0.42:22 hangs.
SG sg-bastion-ingress allows 22/tcp from pl-corp-onprem.
Packet capture on bastion: SYN arrives, SYN-ACK leaves, but client never gets it.
Other instances in the same subnet behave the same way.

Constraints

Subnet	DMZ `subnet-dmz-use1a` (10.20.0.0/24)
NACL	`nacl-dmz` applied to subnet
NACL outbound rule	recently “hardened”: allow 80/443 to 0/0; deny all else
SG	stateful (allows return automatically)
NACL	stateless (return must be explicitly allowed)

Note: SGs are stateful, NACLs are not. A “security hardening” that locks NACL outbound to 80/443 is the #1 trigger of this class of failure.

VisualSG/NACLS022 · 2/8

Where the SYN-ACK gets dropped

request hop (allowed) SYN-ACK dropped at NACL out NACL ruleset (stateless) SG (stateful, fine) ⓘ hover for detail

SimSG/NACLS022 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	SG ingress missing 22	`describe-security-groups`
H2	NACL inbound 22 missing	`describe-network-acls`
H3	NACL outbound ephemeral missing — SYN-ACK drops	NACL out rules
H4	Asymmetric routing (TGW back path different)	RT inspection
H5	Host firewall (Defender/iptables)	local netsh / iptables -L

Cheeky: NACL behaviour is identical to a Linux iptables policy in stateless mode. If you ever wrote “iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT” without remembering the conntrack module, you've made this exact mistake.

LabSG/NACLS022 · 4/8

Diagnose

# 1. Reachability Analyzer (config-time check)
aws ec2 create-network-insights-path \
  --source $(jumphost-eni) --destination i-bastion \
  --protocol TCP --destination-port 22

# 2. NACL outbound rules
aws ec2 describe-network-acls \
  --filters Name=association.subnet-id,Values=subnet-dmz-use1a \
  --query 'NetworkAcls[].Entries[?Egress==`true`]'

# 3. Live capture on bastion
sudo tcpdump -ni any host 10.0.5.10 and port 22 -w /tmp/cap.pcap

# 4. Traffic Mirroring (cheeky)
aws ec2 create-traffic-mirror-session \
  --network-interface-id eni-bastion \
  --traffic-mirror-target-id tmt-0xx \
  --traffic-mirror-filter-id tmf-0xx --session-number 1

# 5. VPC Flow Log query
aws logs filter-log-events --log-group /aws/vpc/flow \
  --filter 'srcaddr=10.20.0.42 dstaddr=10.0.5.10 action=REJECT'

Watch for: action=REJECT in flow log with high dst port (32768-60999) — that's the NACL drop signature.

SimSG/NACLS022 · 5/8

Root cause

Security team “hardened” the DMZ NACL outbound to 80/443 only in a recent control change.
Bastion accepted SYN, sent SYN-ACK to client's ephemeral port (e.g. 54321).
NACL outbound denied port 54321 → SYN-ACK dropped at the subnet boundary.
Client retransmits, all dropped, eventual timeout. Symptom: “SSH hangs.”

Gotcha: NACLs are stateless. Allowing inbound 22 means nothing if the egress side blocks the ephemeral return port. SGs hide this for you because they're stateful.

IaCSG/NACLS022 · 6/8

Fix — ephemeral range on NACL outbound

resource "aws_network_acl_rule" "dmz_out_ephem" {
  network_acl_id = aws_network_acl.dmz.id
  egress         = true
  rule_number    = 110
  rule_action    = "allow"
  protocol       = "6"
  cidr_block     = "10.0.0.0/8"     # corp space
  from_port      = 1024
  to_port        = 65535
}
resource "aws_network_acl_rule" "dmz_out_https" {
  network_acl_id=aws_network_acl.dmz.id; egress=true
  rule_number=120; rule_action="allow"
  protocol="6"; cidr_block="0.0.0.0/0"; from_port=443; to_port=443
}

IaC standard: our shared module modules/nacl-tier always emits an ephemeral-out rule (1024-65535) to the corp prefix and to 0/0. The “hardening” PR that broke this should have failed the module's test suite.

Lint guard

# tflint custom rule
rule "aws_network_acl_must_have_ephemeral_egress" {
  enabled = true
  message = "NACL egress must include 1024-65535 (ephemeral)"
}

ConceptSG/NACLS022 · 7/8

Cheeky & prevention

Cheeky #1

Many shops just “don't use NACLs except for big strokes” (block known bad ports/CIDRs at the subnet boundary). Use SGs as the per-resource policy. Less footgun surface.

Cheeky #2

Linux ephemeral range varies. RHEL 8: 32768-60999. Older: 1024-65535. Windows: 49152-65535. NACLs need to cover all 1024-65535 for safety.

Cheeky #3

VPC Reachability Analyzer evaluates NACL config; it would have caught this before deploy — if anyone had run it.

Prevent

Pre-deploy: every NACL change runs Reachability Analyzer for representative source/dest pairs. CI fail on REJECT.

Prevent

VPC Flow Log alarm on action=REJECT to subnet-internal IPs over 5-min baseline.

Prevent

Module-only NACLs — SCP forbids creating aws_network_acl outside the module path.

LabSG/NACLS022 · 8/8

Interactive lab

Lab S022: Find which side of the NACL drops the returnsimulated

Objectives

0 / 0

ConceptSG/NACLS023 · 1/8

Symptom — cross-VPC SG reference is rejected

Observed

Terraform creates SG rule referencing sg-12345 from another VPC (TGW peer): InvalidGroup.NotFound.
Same source SG works for in-VPC peers.
Both VPCs in same region; both in same AWS account.

Constraints

SG-to-SG references work over VPC peering (same region) but require RAM share or VPC-peering relationship.
SGs cannot be referenced over TGW by default unless the SG is shared via RAM.
Rules with SG ref do not work across regions or across accounts unless RAM-shared.

Note: the cleaner pattern is to use customer-managed prefix lists referenced in SGs, not cross-VPC SG IDs.

VisualSG/NACLS023 · 2/8

SG ref vs Prefix List vs RAM

broken: cross-VPC SG ID fix: managed prefix list + RAM cross-VPC connectivity ⓘ hover for detail

SimSG/NACLS023 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	SG ID typo	`describe-security-groups`
H2	SG in different VPC; no RAM share	`ram list-resources`
H3	VPCs in different regions	compare regions
H4	Provider in Terraform points to wrong account	provider alias check

Cheeky: instead of debating cross-VPC SG references, switch to customer-managed prefix lists as the source-of-truth address pool. SGs reference prefix list IDs natively.

LabSG/NACLS023 · 4/8

Diagnose

# 1. Confirm SG exists where you think
aws ec2 describe-security-groups --group-ids sg-bastion \
  --query 'SecurityGroups[].{V:VpcId,O:OwnerId,N:GroupName}'

# 2. Is it RAM-shared?
aws ram list-resources --resource-owner SELF \
  --resource-type ec2:SecurityGroup

# 3. Switch to prefix-list approach
aws ec2 create-managed-prefix-list \
  --address-family IPv4 --max-entries 50 \
  --prefix-list-name pl-shared-svcs \
  --entries 'Cidr=10.30.0.0/16,Description=shared'

# 4. SG rule using prefix list
aws ec2 authorize-security-group-ingress --group-id sg-orders \
  --ip-permissions 'IpProtocol=tcp,FromPort=22,ToPort=22,
    PrefixListIds=[{PrefixListId=pl-0xx}]'

SimSG/NACLS023 · 5/8

Root cause

SGs are scoped to a VPC. SG-to-SG ingress rules require both SGs in the same VPC or RAM-shared SG with VPC-peering or shared subnets.
Across TGW (which is L3, not L2), SG IDs are not dereferenceable without RAM.
Most teams discover this only when scaling beyond a single VPC.

Gotcha: RAM-shared SGs add operational coupling: shared owner, shared deletes, blast radius. Prefix lists usually win on operational simplicity.

IaCSG/NACLS023 · 6/8

Fix

# Owner: gc-network repo
resource "aws_ec2_managed_prefix_list" "shared_svcs" {
  name           = "pl-shared-svcs"
  address_family = "IPv4"
  max_entries    = 50
  entry { cidr="10.30.0.0/16"; description="shared-svcs VPC" }
  tags = local.tags
}
resource "aws_ram_resource_share" "pl" {
  name       = "gc-prefix-lists"
  principals = [for o in local.spoke_ous : o]
}
resource "aws_ram_resource_association" "pl_shared" {
  resource_share_arn = aws_ram_resource_share.pl.arn
  resource_arn       = aws_ec2_managed_prefix_list.shared_svcs.arn
}

# Consumer: gc-prod-app repo
data "aws_ec2_managed_prefix_list" "shared" {
  name = "pl-shared-svcs"
}
resource "aws_security_group_rule" "orders_in" {
  type              = "ingress"
  from_port         = 22
  to_port           = 22
  protocol          = "tcp"
  prefix_list_ids   = [data.aws_ec2_managed_prefix_list.shared.id]
  security_group_id = aws_security_group.orders.id
}

IaC note: RAM-share once in gc-network; consume by name in every spoke. Prefix list updates fan out automatically.

ConceptSG/NACLS023 · 7/8

Cheeky & prevention

Cheeky #1

Prefix lists count as 1 rule per prefix list ref in SG limits, regardless of entry count. Big quota saver.

Cheeky #2

SG references work cross-VPC for ALB target groups inside same account — useful for shared-services LBs.

Cheeky #3

For ECS tasks across services, share an SG via RAM and reference it directly — cleaner than maintaining IPs.

Prevent

Custom checkov rule: forbid source_security_group_id with hardcoded sg-* across VPCs — force prefix list use.

Prevent

Spoke account README documents pl-* names + how to consume.

Prevent

RAM resource share reviewed quarterly; unused shares removed.

LabSG/NACLS023 · 8/8

Interactive lab

Lab S023: Cross-VPC SG ref — pivot to prefix listsimulated

Objectives

0 / 0

ConceptSG/NACLS024 · 1/8

Symptom — `RulesPerSecurityGroupLimitExceeded`

Observed

Terraform apply fails: RulesPerSecurityGroupLimitExceeded.
SG already has 60 inbound rules; adding more for new microservice.
Soft limit 60 in/out; hard limit raised to 250 with quota request.

Constraints

Quota L-0EA8095F per SG.
Increasing causes proportional ENI quota cost.
Org has 5 SGs per ENI default (can be raised to 16).

Note: SGs × rules × ENIs is one product — raising any one tightens the others.

VisualSG/NACLS024 · 2/8

Rules × ENIs × quota math

quota math (the limit) SG × ENI math collapse via prefix list ⓘ hover for detail

SimSG/NACLS024 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	60-rule cap reached	count rules per SG
H2	Each microservice CIDR added separately	look for repeating /32s
H3	Could be folded into prefix list	check duplicate descriptions

Cheeky: describe-security-group-rules --filters Name=group-id,Values=sg-x | jq '.SecurityGroupRules | length' tells you exactly how close to the limit you are.

LabSG/NACLS024 · 4/8

Diagnose

# 1. Count rules
aws ec2 describe-security-group-rules \
  --filters Name=group-id,Values=sg-orders \
  --query 'length(SecurityGroupRules)'

# 2. Find duplicates / mergeable rules
aws ec2 describe-security-group-rules \
  --filters Name=group-id,Values=sg-orders \
  --query 'SecurityGroupRules[].{P:IpProtocol,F:FromPort,T:ToPort,C:CidrIpv4}' \
  | jq 'group_by(.C) | map({C:.[0].C, ports: map([.F,.T])})'

# 3. Quota
aws service-quotas get-service-quota \
  --service-code vpc --quota-code L-0EA8095F

# 4. Request raise
aws service-quotas request-service-quota-increase \
  --service-code vpc --quota-code L-0EA8095F --desired-value 250

SimSG/NACLS024 · 5/8

Root cause

App SG accreted CIDR rules per microservice (40+ /32s for partner IPs).
Default 60-rule SG cap reached.
Add another /32 → deny.

Gotcha: raising the per-SG rule cap reduces SGs-per-ENI quota in the same proportion. Read the docs — the trade is real.

IaCSG/NACLS024 · 6/8

Fix — collapse to prefix list

resource "aws_ec2_managed_prefix_list" "partners" {
  name           = "pl-orders-partners"
  address_family = "IPv4"
  max_entries    = 60

  dynamic "entry" {
    for_each = var.partner_ips
    content {
      cidr        = "${entry.value.cidr}/32"
      description = entry.value.name
    }
  }
}
resource "aws_security_group_rule" "orders_partners" {
  type              = "ingress"; from_port=443; to_port=443; protocol="tcp"
  prefix_list_ids   = [aws_ec2_managed_prefix_list.partners.id]
  security_group_id = aws_security_group.orders.id
}

IaC note: 50 partners now consume 1 SG rule. Future partner adds = prefix list update only, no SG rule churn.

Bonus

# Lambda updates pl-orders-partners from a CSV in S3 daily

ConceptSG/NACLS024 · 7/8

Cheeky & prevention

Cheeky #1

Prefix list version increments on every change — SGs auto-reference latest. No SG churn.

Cheeky #2

Don't over-merge ports. 22-3389 looks tight but lets RDP through where you only wanted SSH. Be specific.

Cheeky #3

Use VPC Lattice for L7 service-to-service auth where possible — SGs only on ingress edge.

Prevent

CW alarm: per-SG rule count > 50 (warn) / > 58 (alert) via Config rule.

Prevent

Pre-merge tflint: detect >3 individual aws_security_group_rule with same protocol+port → suggest prefix list.

Prevent

Quarterly SG cleanup: deduplicate, collapse, retire dead apps.

LabSG/NACLS024 · 8/8

Interactive lab

Lab S024: Count rules and propose collapsesimulated

Objectives

0 / 0

ConceptSG/NACLS025 · 1/8

Symptom — ALB targets unhealthy

Observed

ALB shows targets unhealthy with reason Health checks failed.
App responds locally (curl localhost:8080/health returns 200).
From bastion, app responds.
From ALB to target? No.

Constraints

ALB has its own SG (sg-alb-orders).
Target SG (sg-orders-task) ingress doesn't allow ALB SG.
Common pattern miss when migrating from CLB to ALB.

VisualSG/NACLS025 · 2/8

ALB SG → target SG reference

DMZ subnet (ALB) Private subnet (target) missing SG rule (the bug) fix: ALB SG → target SG ⓘ hover for detail

SimSG/NACLS025 · 3/8

Hypotheses

#	Hypothesis	Disprove
H1	Target SG missing ingress from ALB SG	`describe-security-groups`
H2	Health-check path 404	app log
H3	Target port mismatch (TG 80 vs app 8080)	`describe-target-groups`
H4	Target deregistered	`describe-target-health`

Cheeky: Target SG “allow ALB SG” pattern is symmetric — you can have many ALBs share sg-alb-edge; targets need only one allow rule.

LabSG/NACLS025 · 4/8

Diagnose

# 1. Target health
aws elbv2 describe-target-health \
  --target-group-arn arn:...:targetgroup/orders \
  --query 'TargetHealthDescriptions[].{T:Target.Id,S:TargetHealth.State,R:TargetHealth.Reason,D:TargetHealth.Description}'

# 2. ALB and target SGs
aws elbv2 describe-load-balancers --names orders-alb \
  --query 'LoadBalancers[].SecurityGroups'
aws ec2 describe-security-groups --group-ids sg-orders-task \
  --query 'SecurityGroups[].IpPermissions'

# 3. Add ingress from ALB SG
aws ec2 authorize-security-group-ingress \
  --group-id sg-orders-task \
  --ip-permissions 'IpProtocol=tcp,FromPort=8080,ToPort=8080,
    UserIdGroupPairs=[{GroupId=sg-alb-orders}]'

# 4. Re-check health (~30s)
sleep 30
aws elbv2 describe-target-health --target-group-arn ...

SimSG/NACLS025 · 5/8

Root cause

Migrating from Classic LB (which used amazon-elb SG — deprecated) to ALB.
Old target SG had amazon-elb/sg-AAA as ingress.
New ALB SG sg-alb-orders not added; ALB-to-target traffic dropped.

Gotcha: ALB doesn't use the legacy “amazon-elb” SG — you must explicitly create your own ALB SG and reference it.

IaCSG/NACLS025 · 6/8

Fix

resource "aws_security_group" "alb_orders" {
  vpc_id = aws_vpc.main.id
  ingress { from_port=443; to_port=443; protocol="tcp"; cidr_blocks=["0.0.0.0/0"] }
  egress  { from_port=0;  to_port=0;  protocol="-1"; cidr_blocks=["0.0.0.0/0"] }
  tags    = local.tags
}
resource "aws_security_group_rule" "orders_task_in_alb" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.alb_orders.id
  security_group_id        = aws_security_group.orders_task.id
  description              = "ALB orders → task"
}

IaC note: wrap this in a module modules/alb-target-sg-pair that emits both SGs as a unit. Module test: target SG must have ingress from the LB SG.

Health-check ingress hint

# Some pre-flight: TG health-check hits port:
# Make sure SG allows that exact port (might differ from app port)

ConceptSG/NACLS025 · 7/8

Cheeky & prevention

Cheeky #1

Set ALB enable_cross_zone_load_balancing + tune deregistration_delay to 30s. Faster blue/green flips.

Cheeky #2

For NLB targets, SGs apply only when target = instance. Target = IP uses subnet/SG of the IP's ENI — trickier auditing.

Cheeky #3

NLB has SG support since 2023 — older NLBs may not have one attached. Add via set-security-groups.

Prevent

Health-check probes from synthetic; alarm if any TG has > 0 unhealthy > 2 min.

Prevent

Module test: deploy ALB+target, assert healthy in < 60s.

Prevent

Custom Config rule: TG must have at least one ingress rule referencing parent ALB SG.

LabSG/NACLS025 · 8/8

Interactive lab

Lab S025: Add the ALB SG to target SG ingresssimulated

Objectives

0 / 0

GlobalCorp AWS Troubleshooting Playbook

Per-scenario slide map (8–10 slides)

Three habits that turn this into deep practice

Conventions you'll see

The business

Why this matters for troubleshooting

Account inventory (used throughout)

AWS-side

OS-side (Linux)

OS-side (Windows)

EC2 launches but fails to domain-join corp.globalcorp.local

What the user sees

Business impact

Constraints (read these first)

Hypothesis tree (5-step)

Bisection plan

From your laptop (cross-account)

Inside the instance via SSM Run Command (cheeky)

What “good” looks like

Cross-check on the firewall side

Story

Why this took a week to find

Immediate (PR + console — gc-network)

Validate

Correct fix — constrain the range, codify it

Branch + plan

Snippet of the change (HCL + Suricata)

ad-allow.suricata (added rule)

PR + apply

SSM-as-PowerShell

Reachability Analyzer's blind spot

FW log query without log-archive read

Constrain dynamic ports

Imdsv2 + secret pull from user-data

Tagging hint

Detective controls (CloudWatch & Config)

Preventive controls

EC2 lifecycle & provisioning failures

What happened

Business impact

Constraints

CloudTrail look-up

The chain of events

Key policy patch (in gc-tools-cicd repo)

Spoke side — ASG service role + KMS

Cheeky #1

Cheeky #2

Cheeky #3

Prevent — CW alarm

Prevent — Config rule

Prevent — Service Catalog

Observed

Impact

Constraints

Quick path

Linux

Windows

What we found

Packer provisioner (Linux)

Packer provisioner (Windows)

Terraform AMI promotion gate

Cheeky #1

Cheeky #2

Cheeky #3

Prevent #1

Prevent #2

Prevent #3

Observed

Constraints

Why the wrong creds persisted

Why triggers = ["launch_template"] matters

Cheeky #1

Cheeky #2

Cheeky #3

Prevent

Prevent

Prevent

Observed

Constraints

Story

Key policy patch (in `gc-tools-cicd` repo)

Why `triggers = ["launch_template"]` matters