Hub/AWS/GlobalCorp Troubleshooting Playbook
Live Cached · Offline-ready 0 / 0

GlobalCorp AWS Troubleshooting Playbook

250 production scenarios · native commands · architecture diagrams · interactive labs · IaC fixes
Identity: IAM + Okta MFA Network: TGW hub-spoke AD: on-prem (corp.globalcorp.local) IaC: Terraform per account
Browser-based Works offline Progress saves locally
A vagmin.cloud-style hands-on playbook · 4-layer pedagogy · v1.0
Concept Foundation

How this playbook is taught — four layers, every scenario

You don't just read about a problem. You see it diagrammed, watch it play out, then run the commands yourself in a guided terminal.

1Concept
Read the symptom in plain language with GlobalCorp business context. The “why” shapes the “how”.
2Visual
See it diagrammed — architecture layers, network topology, traffic path, allowed vs blocked.
3Simulation
Walk the hypotheses & root cause as a guided sequence. Trade-offs and alternatives are called out.
4Lab
Type the real commands in an embedded terminal. Get simulated production output. Hints on demand.

Per-scenario slide map (8–10 slides)

#SlideLayer
1Symptom & business impactConcept
2Architecture diagram (where it's blocking)Visual
3Hypotheses & debug methodSim
4Diagnose — native commandsLab
5Root causeSim
6Fix — commandsLab
7IaC change (Terraform)IaC
8Cheeky / non-obviousConcept
9Prevent / monitorConcept
10Interactive lab terminalLab

Three habits that turn this into deep practice

  1. Read the scenario first. Skim symptom + impact. The constraints frame the diagnosis.
  2. Follow the traffic. Trace one request edge-to-target. The diagrams highlight the failing hop.
  3. Compare trade-offs. Most fixes have a quick patch and a durable IaC fix. Note when an SCP would block the patch.

Conventions you'll see

MarkerMeaning
tipBest practice or non-obvious trick
noteCommon assumption to verify
gotchaBites you in production
IaCTerraform / IaC change to lock in the fix
Concept Foundation

Fictional company — GlobalCorp Holdings

The business

  • ~6,000 employees, financial services + 2 recent subsidiaries (FinSub, RetailSub)
  • Hybrid: 4 datacenters (NY, LDN, SG, FRA), Direct Connect to AWS in us-east-1, eu-west-1
  • Production: EC2 + ECS + RDS + FSx Windows, microservices in private subnets behind ALB/NLB
  • Identity: Okta for MFA & SSO, federated to per-account IAM via SAML (no IDC — legacy)
  • EC2 fleet domain-joined to on-prem AD (corp.globalcorp.local). DNS via R53 Resolver outbound endpoint
  • Each AWS account has its own Git repo + Terraform state in S3 + DynamoDB lock

Why this matters for troubleshooting

  • Cross-account routing via Transit Gateway with route domains per environment
  • SCP enforces tagging, region restriction, IMDSv2-only
  • RAM-shared resources: PHZ associations, prefix lists, TGW
On-prem datacentersNY · LDN · SG · FRAcorp.globalcorp.local ADBastion + monitoring AWS Org Rootus-east-1 (primary)eu-west-1 (DR/EU)Control Tower-style LZ Direct Connect Okta tenantSSO + MFASAML to IAMSCIM groups SubsidiariesFinSub (own AD trust)RetailSub (PoC, separate VPC)Joined via TGW peering Terraform multireporepo per account, S3 backend, DDB lockCI: GitHub Actions OIDC → per-account role
Visual Foundation

AWS Organization — OUs & account inventory

Root (Mgmt acct) Security OU Infra/Shared OU Workloads OU Sandbox OU Subs OU log-archiveaudit (SecHub)guardduty-agg network-hubshared-servicesdns-hubbackup-vault prod-appprod-datastg-appdev-apptools-cicd sandbox-dev1sandbox-dev2 finsub-prodretailsub-poc SCPs applied (highlights) • deny actions without tag CostCenter on EC2/RDS/EKS • deny region != us-east-1 / eu-west-1 (workloads) • deny RunInstances if metadata-options != IMDSv2 required • deny iam:CreateUser org-wide (federation only) • deny disable of GuardDuty / CloudTrail / Config • sandbox: deny TGW attachments, deny VPC peering creation

Account inventory (used throughout)

AccountAliasPurpose
111111111111gc-mgmtOrg root, billing
222222222222gc-log-archiveCentral CloudTrail/Config logs
333333333333gc-auditSecurity Hub, GuardDuty admin
444444444444gc-networkTGW, R53 Resolver, DX, network firewall
555555555555gc-shared-svcsFSx, AD connector, jump hosts
666666666666gc-prod-appCustomer-facing microservices
777777777777gc-prod-dataRDS, ElastiCache, FSx for SQL backups
888888888888gc-stg-appStaging mirror
999999999999gc-dev-appDev workloads
121212121212gc-tools-cicdGitHub Actions OIDC, artifacts
131313131313gc-finsub-prodFinSub subsidiary prod
141414141414gc-retailsub-pocRetailSub PoC
Visual Foundation

Network architecture — TGW hub & spoke

Region AZ VPC Public Private Restricted On-prem ⓘ hover icons & arrows for details
External / Identity providers Customer-managed Prefix Lists (RAM-shared) 10/8 · 10.20/16 10.30/16 · 10.50/16 SCP: only gc-network User → Edge chain (Internet to ALB) SAML → STS AssumeRoleWithSAML Region · us-east-1 (primary) Account: gc-network (444444444444) gc-prod-app (666...) · VPC 10.20.0.0/16 us-east-1a us-east-1b (mirror) gc-shared-svcs (555...) · VPC 10.30.0.0/16 gc-prod-data (777...) · VPC 10.21.0.0/16 Subsidiaries / VMC / SD-WAN
Visual Foundation

Inventory catalog — reusable components behind every diagram

Every scenario diagram is built from this fixed set: 70+ AWS service icons in standard category colours, 7 container styles (region/AZ/VPC/subnet tiers/account/on-prem), and ~10 composite mini-diagrams. No bespoke geometry per scenario — updates here propagate everywhere.

Compute (orange) EC2 ASG Launch Template Lambda ECS EKS Storage (green) S3 FSx EBS Snap AWS Backup Vault Database (blue) RDS Aurora DynamoDB ElastiCache OpenSearch Mgmt & Integration (pink) CloudWatch CWLG SSM CloudTrail EventBridge Step Fns X-Ray SQS Networking (purple) VPC ALB NLB CloudFront API Gw Route 53 R53 HC IGW NAT TGW VPCe DX DX Gw VPN Customer Gw Network FW ENI Prefix List SG RAM Lattice ACM WAF Shield Security / Identity (red + blue) IAM KMS Secrets Mgr AWS Org SCP Control Tower Landing Zone Identity Ctr GuardDuty Config Managed AD on-prem AD Okta Entra ID Azure AD O365 JWKS External / Hybrid / Legacy (gray + brand colors) User On-prem DC Internet Border GW Palo Alto FW SD-WAN VMC on AWS Oracle DB COBOL Mainframe Terraform Git Container styles Region AZ VPC Public subnet Private subnet Restricted Account On-prem Composite patterns (drop-in mini-diagrams) pat-edge-chain, pat-domain-join, pat-saml-flow, pat-onprem-stack, pat-dx-stack, pat-3az-vpc, pat-side-id, pat-side-obs, pat-side-dr, pat-control-tower-lz, pat-vmc-sdwan-oracle
Lab Foundation

Tooling inventory — what to reach for first

AWS-side

  • aws cli v2
  • SSM Run Command — PowerShell/bash, no SSH
  • SSM Session Manager — audited interactive shell
  • SSM Port Forwarding — tunnel 3389/1433/anything
  • VPC Reachability Analyzer
  • Network Access Analyzer
  • VPC Flow Logs (rich format)
  • R53 Resolver query logs
  • CloudTrail Lake (SQL across accounts)
  • IAM Access Analyzer

OS-side (Linux)

  • ip route get <ip>
  • ss -tnp
  • tracepath -n
  • mtr --report-wide -c 50
  • tcpdump -ni any host X and port Y
  • getent hosts / resolvectl
  • curl http://169.254.169.254/... (IMDSv2 token)
  • cloud-init-output.log
  • amazon-ssm-agent logs

OS-side (Windows)

  • Test-NetConnection -CommonTCPPort RDP -ComputerName X
  • Get-NetRoute / Get-NetIPConfiguration
  • Resolve-DnsName -Server X
  • nltest /sc_query:corp.globalcorp.local
  • dsregcmd /status
  • klist / klist purge
  • w32tm /query /status /verbose
  • Get-WinEvent -LogName System
Concept Foundation

Scenario index — 250 total, batched 25 at a time

#CategoryCountStatus
01EC2 lifecycle & provisioning20live (S002–S021)
02Security Groups & NACLs20partial (S022–S025 of 20)
03IAM, instance roles, cross-account20queued
04VPC, subnets, route tables15queued
05Transit Gateway & cross-acct routing15queued
06DNS / Route 53 / Resolver15queued
07Active Directory & domain join15queued
08Systems Manager (SSM)15queued
#CategoryCountStatus
09VPC endpoints15queued
10CloudWatch Logs & Metrics15queued
11Load balancers (ALB/NLB)15queued
12Backup & DR15queued
13FSx & storage10queued
14Okta / federation / MFA15queued
15Terraform / IaC operations20queued
16Org / SCP / Landing Zone10queued
Sample scenario S001 follows — demonstrates the full 10-slide pattern with the embedded lab terminal so you can validate the format before Batch 1 (20 EC2 scenarios).
Scenario S001 · EC2 lifecycle · Active Directory

EC2 launches but fails to domain-join corp.globalcorp.local

A new prod-app Windows instance comes up, user-data runs, but Add-Computer fails with An Active Directory domain controller for the domain could not be contacted. App team is blocked.
Severity: P2 Layer touched: DNS · Network · AD 10 slides · lab included
Concept EC2 · AD S001 · 1/10

Symptom & business impact

What the user sees

  • Terraform apply is green; instance i-0abc123 reaches running.
  • App team RDPs in via SSM port-forward, opens Event Viewer:
System > NETLOGON > Event 5719:
This computer was not able to set up a secure session with a domain
controller in domain corp.globalcorp.local because of the following:
The remote procedure call was cancelled.

cloud-init-output.log:
Add-Computer : An Active Directory domain controller (AD DC) for the
domain corp.globalcorp.local could not be contacted.

Business impact

  • Orders microservice deployment paused (needs domain-joined service account)
  • SLO at risk: 30 min budget left before stakeholder Slack at 4·9s
  • Recurring — happened to 3 instances this week, hand-fixed each time

Constraints (read these first)

ConstraintImplication
SCP requires IMDSv2User-data must use IMDSv2 token call
SCP denies iam:CreateUserDomain-join uses a vaulted AD service account, not IAM user
VPC has no public subnetNo direct internet to secretsmanager.amazonaws.com — must use VPCe
R53 Resolver outbound rule for corp.globalcorp.localDNS query must reach Resolver → corp DCs over TGW
SG sg-prod-private-windows-ingressEgress 53/88/389/445/etc to pl-corp-onprem
Note: the same NETLOGON 5719 + RPC cancelled string covers seven different root causes — SG, NACL, TGW route, DNS, secret access, time skew, OU permission. We'll narrow it.
Visual EC2 · AD S001 · 2/10

Where the traffic should go (and where it's probably blocked)

allowed hop blocked / dropped region VPC AZ private ⓘ hover icons & arrows for details
Expected: EC2 → .2 → R53 Resolver Out → TGW → NFW → DX → corp DC   (reverse session same path) Domain-join logical flow (hover any hop) NFW DROPS 49152–65535 RPC dyn (LSAD/SAM/SCM) Region us-east-1 — gc-prod-app account zoom VPC 10.20.0.0/16 us-east-1a Private subnet 10.20.10.0/24 · sg-prod-windows VPCe (interface) subnet · private DNS enabled All required for SSM session, secret pull, IAM signing. If NFW blocks the VPCe data path, user-data fetch fails before domain-join even starts. SCP requires IMDSv2 (HttpTokens=required); user-data must PUT-token first. Domain-join executed by SSM doc GC-JoinOnPremAD with svc-domjoin from Secrets Mgr. Network Firewall rule group: gc-corp-ad-allow Stateful (Suricata) Rule order matters; first match wins pass tcp $HOME_NET any -> $CORP_DC [53,88,389,636,3268,3269,445] pass tcp $HOME_NET any -> $CORP_DC 123 · 464 (NTP, KPasswd) drop tcp $HOME_NET any -> $CORP_DC any (catch-all) <-- our culprit FIX: pass tcp $HOME_NET any -> $CORP_DC 50000:50099 AD team pinned RPC dyn range; NFW now permits the constrained subset. Required AD ports (Microsoft contract): 53 DNS · 88/464 Kerberos · 389/636 LDAP/S · 3268/3269 GC · 445 SMB · 135 EPM + 49152–65535 RPC dynamic (Win 2008+) — the most-missed entry 123 NTP — clock drift >5min and Kerberos refuses (KRB_AP_ERR_SKEW) 2049 NFS / 3268 GC / 9389 ADWS — situational extras Validation after fix: Test-NetConnection 10.10.0.10 -Port 50050 => TcpTestSucceeded : True Add-Computer -DomainName corp.globalcorp.local -Restart => success
Sim EC2 · AD S001 · 3/10

Hypotheses & debug method — narrow before you bisect

Hypothesis tree (5-step)

#LayerHypothesisFalsify with
H1IdentityInstance can't fetch svc-domjoin secret (no role / wrong KMS)aws sts get-caller-identity · get-secret-value from instance
H2ReachSG egress missing port to corp DCsReachability Analyzer 53/88/389/445
H3ReachTGW route table missing 10.0.0.0/8aws ec2 search-tgw-routes
H4ReachInspection FW dropping RPC dynamic 49152–65535FW logs + Test-NetConnection -Port 50000
H5DNSR53 Resolver rule missing/disassociated from VPClist-resolver-rule-associations
H6AuthTime skew > 5 min → Kerberos refusesw32tm /query /status
H7AuthService account svc-domjoin lacks Add Computer right on target OUDC sec event 4625 + delegation review

Bisection plan

  1. DNS first. If DNS fails everything else looks broken. Resolve-DnsName -Type SRV _ldap._tcp.dc._msdcs.corp.globalcorp.local
  2. L4 reach to a DC. Test-NetConnection 10.10.0.10 -Port 389 & -Port 445.
  3. RPC dynamic. Most cheeky — FW rule passes well-known but drops 49152–65535. Pick a random port, test.
  4. Identity. Pull the secret from instance to confirm role + KMS pass.
  5. Auth. Time + OU rights are the last 5%.
Cheeky: use VPC Reachability Analyzer on a single representative port (445) to prove TGW + inspection paths in seconds — before opening a console session.
Lab EC2 · AD S001 · 4/10

Diagnose — native commands you'll run

From your laptop (cross-account)

# confirm identity + assume into prod-app
aws sts get-caller-identity
aws sts assume-role --role-arn arn:aws:iam::666666666666:role/FedAppDev \
    --role-session-name s001 --query Credentials

# Reachability Analyzer (config-time check)
aws ec2 create-network-insights-path \
    --source i-0abc123 --destination 10.10.0.10 \
    --protocol TCP --destination-port 445
aws ec2 start-network-insights-analysis --network-insights-path-id nip-...

Inside the instance via SSM Run Command (cheeky)

aws ssm send-command --instance-ids i-0abc123 \
  --document-name AWS-RunPowerShellScript \
  --parameters 'commands=[
    "Resolve-DnsName -Type SRV _ldap._tcp.dc._msdcs.corp.globalcorp.local",
    "Test-NetConnection 10.10.0.10 -Port 389",
    "Test-NetConnection 10.10.0.10 -Port 445",
    "Test-NetConnection 10.10.0.10 -Port 50000",
    "w32tm /query /status",
    "klist"
  ]'

What “good” looks like

Name                                    Type   TTL  Section
----                                    ----   ---  -------
_ldap._tcp.dc._msdcs.corp.globalcorp... SRV    600  Answer
  Priority : 0  Port : 389  Target : dc1-ny.corp...

ComputerName     : 10.10.0.10
RemoteAddress    : 10.10.0.10
RemotePort       : 389
TcpTestSucceeded : True

ComputerName     : 10.10.0.10
RemotePort       : 50000
TcpTestSucceeded : False   # <-- this is our smoking gun

Cross-check on the firewall side

# Network Firewall log query in CloudWatch Logs Insights
fields @timestamp, src, dst, dst_port, action
| filter dst = "10.10.0.10"
| filter src like /^10\.20\.10\./
| filter action = "DROP"
| stats count() by dst_port
Press Next: the lab terminal at slide 10/10 lets you type these commands and see simulated production output.
Sim EC2 · AD S001 · 5/10

Root cause — RPC dynamic range dropped at Inspection FW

Story

  • The Network Firewall stateful rule group gc-corp-ad-allow permits the well-known AD ports: 53/88/389/636/3268/3269/445.
  • It does not permit the RPC dynamic range 49152–65535 (Win Server 2008+ default; AD uses these for SAM/LSAD/NetLogon RPCs).
  • The first phase of Add-Computer succeeds (DNS & LDAP work), but the secure channel setup needs a dynamic RPC port → FW drops → client retries → eventually surfaces as RPC was cancelled.
  • This was masked because the FW was deployed by netsec team using an upstream “AD allow” rule template that pre-dates dynamic RPC.

Why this took a week to find

Misleading signalWhy
NETLOGON 5719 fires for many causesSame code, 7+ root causes
Reachability Analyzer (cfg) passes 445Doesn't test stateful FW dynamic ports
FW logs in central acctDevs lack read on log-archive
Sometimes works (race)If RPC happens to negotiate <49152, it passes
Gotcha: AWS docs list the “ports for AD” without the dynamic range. The Microsoft docs explicitly require 49152–65535/tcp for outbound RPC dynamic. Always check both.
Cheeky: on the DC side you can restrict the dynamic range via registry — many shops set it to 50000-50099 and only allow that subset through the firewall.
Lab EC2 · AD S001 · 6/10

Fix — immediate unblock + correct fix

Immediate (PR + console — gc-network)

# 1. Patch the Network Firewall rule group to permit
#    the AD RPC dynamic range OR a constrained sub-range.
aws network-firewall describe-rule-group \
    --rule-group-name gc-corp-ad-allow --type STATEFUL --query 'RuleGroup' > rg.json

# 2. Append rule (Suricata syntax) and update.
# pass tcp $HOME_NET any -> $CORP_DC any (msg:"AD RPC dyn"; \
#   flow:to_server,established; sid:1000201; rev:1; \
#   dst_port:[49152:65535];)
aws network-firewall update-rule-group \
    --rule-group-name gc-corp-ad-allow --type STATEFUL \
    --rule-group file://rg.json --update-token <token>

Validate

aws ssm send-command --instance-ids i-0abc123 \
  --document-name AWS-RunPowerShellScript \
  --parameters 'commands=[
    "Test-NetConnection 10.10.0.10 -Port 50000",
    "Add-Computer -DomainName corp.globalcorp.local -Credential $cred -Restart"
  ]'

Correct fix — constrain the range, codify it

  1. Ask the AD team to fix the dynamic port range to 50000–50099 on all DCs (registry) and publish it as a documented contract.
  2. Update Network Firewall rule to permit only that constrained range.
  3. Update the SG egress on Windows EC2 to allow the same subset to pl-corp-onprem.
  4. Document in the AD SRE runbook + tag the FW rule group with Owner=netsec, References=AD-DC-Ports.
Cheeky: if AD team can't change DCs quickly, you can require Kerberos-only RPC on the client (KrbtgtFullPacSignature + RestrictNTLM) which sidesteps some legacy SAM RPC paths — only do this with AD's blessing.
IaC EC2 · AD S001 · 7/10

Lock the fix into Terraform — repo: gc-network

Branch + plan

# in gc-network repo
git checkout -b fix/nf-ad-rpc-dynamic
# edit modules/inspection-fw/rules/ad-allow.suricata
git diff --stat
terraform fmt -recursive
terraform validate
terraform plan -var-file=envs/us-east-1.tfvars \
   -target=module.inspection_fw.aws_networkfirewall_rule_group.ad_allow

Snippet of the change (HCL + Suricata)

resource "aws_networkfirewall_rule_group" "ad_allow" {
  capacity = 200
  name     = "gc-corp-ad-allow"
  type     = "STATEFUL"
  rule_group {
    rules_source {
      rules_string = file("${path.module}/rules/ad-allow.suricata")
    }
    rule_variables {
      ip_sets {
        key = "CORP_DC"
        ip_set { definition = ["10.10.0.10/32", "10.10.0.11/32"] }
      }
    }
  }
  tags = local.tags
}

ad-allow.suricata (added rule)

# existing well-known AD ports...
pass tcp $HOME_NET any -> $CORP_DC [53,88,389,636,3268,3269,445] \
  (msg:"AD well-known"; sid:1000101; rev:2;)

# NEW: AD RPC dynamic range (constrained to 50000-50099)
pass tcp $HOME_NET any -> $CORP_DC 50000:50099 \
  (msg:"AD RPC dyn constrained"; flow:to_server,established;
   sid:1000201; rev:1;)

PR + apply

  1. Open PR; checkov + tflint green; CI posts plan to PR.
  2. Reviewers: @netsec-leads, @ad-leads.
  3. Merge → apply.yml assumes FedTerraformApply via OIDC, runs terraform apply.
  4. Drift-check job nightly; if FW console-edited again, drift opens an issue.
IaC note: the dynamic range is now an input variable var.ad_rpc_range = "50000-50099" with the same value referenced by the per-spoke SG egress modules — one source of truth.
Concept EC2 · AD S001 · 8/10

Cheeky / non-obvious tricks pulled in this scenario

SSM-as-PowerShell

You never RDP'd into the host. aws ssm send-command with AWS-RunPowerShellScript ran the diagnostics with full audit (CloudTrail + SSM Run Command history). For interactive work, aws ssm start-session --target i-... is your shell.

Reachability Analyzer's blind spot

It evaluates config: SG, NACL, route table, TGW. It does not evaluate stateful Network Firewall rules. If RA says reachable but it's not, suspect inspection FW, host firewall, MTU, or asymmetric routes.

FW log query without log-archive read

Devs often lack read on gc-log-archive. We expose a cross-account CloudTrail Lake datastore + a read-only log-insights view via aws-vault assume-role onto a FedNetTroubleshoot role — can query FW logs without copying data.

Constrain dynamic ports

Default AD RPC dynamic range is huge. We pin DCs to 50000–50099 and document it as the AD-team contract. FW rule shrinks from a 16k-port hole to 100 ports.

Imdsv2 + secret pull from user-data

SCP enforces IMDSv2. Our user-data fetches the IMDSv2 token first, then the role creds, then the secret. If you script the IMDSv1 way it silently 401s and the domain-join “just” fails.

$tk = Invoke-RestMethod -Headers @{"X-aws-ec2-metadata-token-ttl-seconds"="300"} \
   -Method PUT -Uri "http://169.254.169.254/latest/api/token"

Tagging hint

Add DomainJoin=required tag at launch. A maintenance window step waits for that tag, then runs the GC-JoinOnPremAD doc — lets you re-run domain join idempotently after a fix without rebuilding the host.

Concept EC2 · AD S001 · 9/10

Prevention & monitoring — never see this again

Detective controls (CloudWatch & Config)

  • CW alarm on Network Firewall metric DroppedPackets with dimension StatefulRuleGroup=gc-corp-ad-allow — non-zero in 5 min → PagerDuty.
  • CW Logs Insights saved query: “AD RPC drops to corp DCs” bookmarked in dashboard gc-ad-health.
  • Config rule: required-network-firewall-rule-group-tags — rule groups must carry References=AD-DC-Ports tag (…so they show up in this audit).
  • EventBridge on SSMRunCommand success/failure for GC-JoinOnPremAD; failures trigger a Lambda that posts diagnostics to Slack #ad-domain-join.

Preventive controls

  • Synthetic canary: a small Windows EC2 in shared-svcs runs a Test-NetConnection matrix to all corp DCs every 5 min, emits CW custom metric.
  • Pull request CI: any change to aws_networkfirewall_rule_group requires @ad-leads review (CODEOWNERS).
  • SCP exception watcher: if anyone tries to disable IMDSv2 enforcement, audit acct alerts.
  • Runbook link on every NETLOGON 5719 alert that points to this scenario's slide deck.
SLO bookkeeping: add “domain-join success within 5 min of launch” to your platform SLO. The synthetic canary measures it.
Lab EC2 · AD S001 · 10/10

Interactive lab — type the commands, see production output

Lab: Diagnose AD domain-join failure for i-0abc123 simulated · offline
Objectives — complete in any order. Type hint, show, reset, or list at any time.
    PS C:\Windows\system32>
    0 / 0 objectives
    Batch 1 · 20 scenarios · EC2 lifecycle

    EC2 lifecycle & provisioning failures

    Twenty production scenarios — pending instances, failed user-data, cross-account KMS, ASG drift, SCP blocks, ENI exhaustion, SSM-agent gaps, spot/recovery races. Every scenario follows the 8-slide pattern with an interactive lab terminal.
    S002 → S021 160 slides 20 lab terminals
    ConceptEC2 lifecycleS002 · 1/8

    Instance stuck in pending for 8 min then failed

    What happened

    • Terraform apply succeeds. aws ec2 run-instances returns an InstanceId.
    • Instance shows pending for ~8 min then transitions to shutting-down → terminated.
    • StateReason: Server.InternalError: Internal error on launch.

    Business impact

    • Blue/green deploy can't expand green fleet. Stuck behind quota of 1 deploy in flight.
    • Repeats for 4/10 instances, randomly — not deterministic by AZ or instance type.

    Constraints

    ItemDetail
    AMIshared from gc-tools-cicd (121212121212)
    Root volumeEBS encrypted with customer KMS key in gc-tools-cicd
    Launching accountgc-prod-app (666666666666)
    Default EBS encryptionon, with account-default KMS key in 666... (different key)
    Service roleAWSServiceRoleForAutoScaling
    Note: Server.InternalError is the polite version of “something on the EC2 side blew up” — almost always EBS attach, ENI attach, or KMS.
    VisualEC2 lifecycleS002 · 2/8

    The cross-account KMS path that breaks

    AWS account blocked path fix path ⓘ hover icons & arrows
    gc-tools-cicd (121212121212) AMI Snapshot KMS CMK Required IAM / KMS chain 1. KMS key policy in 1212 grants 666 root 2. IAM role in 666 has kms:Decrypt + CreateGrant on key most-missed step in cross-acct shares 3. AWSServiceRoleForAutoScaling has CreateGrant 4. Snapshot shared with 666 (modify-snapshot-attribute) 5. AMI has launch permission for 666 Verify with: kms get-key-policy · kms list-grants cloudtrail lookup-events ResourceName=ami-... ec2 describe-snapshot-attribute gc-prod-app (666666666666) ASG → RunInstances EC2 LT EBS attach FAILS kms:CreateGrant denied principal: AWSServiceRoleForAutoScaling default EBS key orders-task-role root EBS CW alarm Alternative: re-encrypt copy copy-image snap (re-enc) spoke KMS Trade-offs + no cross-acct grants to maintain + blast radius isolated to spoke + simpler IAM to audit − double EBS snapshot storage cost − AMI promotion job needs spoke role + KMS − multi-region: copy per region Pick by org-shape Few spokes → cross-acct grants OK Many spokes / strict isolation → copy-encrypt Compliance requires single-key per acct → copy-encrypt Service Catalog gate Promotion product validates: launch-perm, snapshot-share, KMS grant per spoke in scope If any missing → promotion fails before deploy. Result: cross-acct AMI promotions are auditable.
    SimEC2 lifecycleS002 · 3/8

    Hypotheses & quickest disproof

    #HypothesisDisprove with
    H1EBS attach fails — KMS cross-acct grant missingdescribe-instance-attribute --attribute reason
    H2ENI attach fails — subnet/AZ ran out of IPsdescribe-subnets AvailableIpAddressCount
    H3AZ capacity (Insufficient)StateReasonMessage contains Insufficient capacity
    H4Tenancy mismatch (dedicated host expired)describe-host-reservations
    H5SCP blocking iam:PassRole during launchCloudTrail event RunInstances errorCode

    CloudTrail look-up

    aws cloudtrail lookup-events \
       --lookup-attributes AttributeKey=ResourceName,AttributeValue=i-0xx \
       --max-results 5 --query 'Events[].CloudTrailEvent' \
       | jq -r '.' | jq 'select(.errorCode!=null) | {errorCode,errorMessage}'
    Cheeky: RunInstances is async — the API call is fine, but the failure surfaces in EBS and EC2 events that come later. Lookup by ResourceName, not eventName.
    LabEC2 lifecycleS002 · 4/8

    Diagnose — commands you'll run

    # 1. Pull the StateReason directly
    aws ec2 describe-instances --instance-ids i-0xx \
      --query 'Reservations[].Instances[].{S:State.Name,R:StateReason}'
    
    # 2. Pull instance-status (more granular)
    aws ec2 describe-instance-status --instance-ids i-0xx \
      --include-all-instances
    
    # 3. Inspect the snapshot encryption + KMS key
    aws ec2 describe-snapshots --snapshot-ids snap-0xx \
      --query 'Snapshots[].{Enc:Encrypted,KMS:KmsKeyId,Owner:OwnerId}'
    
    # 4. Check key policy in the source account
    aws --profile gc-tools kms get-key-policy \
       --key-id alias/tools-ami --policy-name default | jq
    # 5. List grants on the key (look for our role)
    aws --profile gc-tools kms list-grants \
       --key-id alias/tools-ami \
       --query 'Grants[?contains(GranteePrincipal,`666666666666`)]'
    
    # 6. Try the decrypt directly with an exec-role on a test instance
    aws ssm send-command --instance-ids i-test \
      --document-name AWS-RunShellScript \
      --parameters 'commands=[
        "aws kms describe-key --key-id arn:aws:kms:us-east-1:121212121212:key/aaa..."
      ]'
    Gotcha: the error in CloudTrail will be AccessDenied on kms:Decrypt with the principal AWSServiceRoleForAutoScalingnot the user/role that called RunInstances.
    SimEC2 lifecycleS002 · 5/8

    Root cause

    The chain of events

    1. AMI ami-prod-base uses an encrypted snapshot backed by KMS key arn:aws:kms:us-east-1:121212121212:key/aaa-tools-ami.
    2. When ASG scales up in gc-prod-app, the launch goes through the service-linked role AWSServiceRoleForAutoScaling.
    3. That role calls kms:CreateGrant on the source key on behalf of EC2/EBS.
    4. The KMS key policy grants 666... root but does not grant kms:CreateGrant to aws-service-role/autoscaling.amazonaws.com.
    5. EBS attach silently fails after instance moves to pending; EC2 retries the EBS detach/re-attach for ~8 min, then gives up → Server.InternalError.
    Gotcha: KMS errors during EC2/ASG launch are not surfaced as KMS errors in the EC2 console. You must look at CloudTrail in the source account (where the key lives), not the launching account.
    IaCEC2 lifecycleS002 · 6/8

    Fix — key policy + Terraform

    Key policy patch (in gc-tools-cicd repo)

    data "aws_iam_policy_document" "tools_ami_key" {
      statement {
        sid = "AllowSpokeAccountsToUseKey"
        actions = ["kms:Decrypt","kms:DescribeKey",
                   "kms:ReEncrypt*","kms:GenerateDataKey*",
                   "kms:CreateGrant"]
        principals { type="AWS"; identifiers=["arn:aws:iam::666666666666:root"] }
        resources = ["*"]
        condition {
          test     = "StringEquals"
          variable = "kms:ViaService"
          values   = ["ec2.us-east-1.amazonaws.com"]
        }
      }
    }

    Spoke side — ASG service role + KMS

    resource "aws_iam_role_policy" "asg_kms" {
      role = "AWSServiceRoleForAutoScaling"
      policy = jsonencode({
        Version="2012-10-17",
        Statement=[{
          Effect="Allow",
          Action=["kms:CreateGrant","kms:Decrypt",
                  "kms:ReEncrypt*","kms:GenerateDataKey*",
                  "kms:DescribeKey"],
          Resource="arn:aws:kms:us-east-1:121212121212:key/aaa-tools-ami"
        }]
      })
    }
    IaC note: the cleaner pattern is to wrap this in a module that takes spoke_account_ids + kms_key_arn and emits both the key policy statement and the spoke IAM role policy from a single locals.tf source of truth.
    ConceptEC2 lifecycleS002 · 7/8

    Cheeky & prevention

    Cheeky #1

    Use VPC Reachability Analyzer? No — this is KMS, not network. Use IAM Access Analyzer (cross-account) to surface keys exposed/granted across accounts before the launch even happens.

    Cheeky #2

    Pre-flight: aws ec2 run-instances --dry-run only checks the calling principal — not the EBS KMS chain. Bake an explicit kms:DescribeKey probe into your AMI promotion job.

    Cheeky #3

    If you can't change the source key policy, copy the AMI into the spoke account and re-encrypt with the local default key. The cost is double-storage; the win is no cross-acct grants to maintain.

    Prevent — CW alarm

    Alarm on the EBS metric VolumeAttachFailures (custom via EventBridge on AttachVolume errorCode), routes to #platform-pager.

    Prevent — Config rule

    kms-cmk-not-scheduled-for-deletion + a custom rule that flags any KMS key shared cross-acct that is missing CreateGrant to autoscaling.amazonaws.com.

    Prevent — Service Catalog

    The AMI promotion product validates: launch perm + snapshot share + KMS grant exist for every spoke account in scope. If not, promotion fails.

    LabEC2 lifecycleS002 · 8/8

    Interactive lab

    Lab S002: Find the KMS cross-account grant gapsimulated
    Objectives
      $
      0 / 0
      ConceptEC2 lifecycleS003 · 1/8

      Symptom — user-data didn't run on first boot

      Observed

      • Instance boots; SSM Session is fine; but the orders-api service isn't installed.
      • cloud-init-output.log is empty on Linux; EC2Launch log shows “UserData persist disabled” on Windows.
      • Manually re-running the script works.

      Impact

      • Auto-built fleet has 30% no-op instances; canary fails to flip green.

      Constraints

      ItemDetail
      OSLinux: AL2023; Windows: Server 2022 EC2Launch v2
      Launch sourceLaunch Template v6 (just promoted)
      AMIbaked yesterday from custom pipeline
      User-datashell script (Linux) / <powershell>...</powershell> (Win)
      Note: “User-data didn't run” has 4 distinct root causes by frequency: (1) AMI was sysprep'd w/o EC2Launch persist, (2) cloud-init disabled in baked image, (3) MIME multi-part malformed, (4) #cloud-config typo.
      VisualEC2 lifecycleS003 · 2/8

      Boot stages where user-data is supposed to run

      Linux boot path Windows boot path state-leak (the bug) ⓘ hover stages for detail
      Linux boot path (AL2023, cloud-init) init modules-config modules-final /var/lib/cloud/ instance/user-data.txt sem/config-scripts-user scripts/per-instance/*.sh Sealed AMI — semaphores already present /var/lib/cloud/sem/config-scripts-user exists cloud-init: “already done” → skips modules-final Fix: cloud-init clean --logs + rm -rf /var/lib/cloud/sem/* in Packer Windows boot path (Server 2022, EC2Launch v2) service ExecuteUserData C:\ProgramData\Amazon\EC2Launch\state\ .run-once .previous-user-data .run-once survived in AMI UserData skipped on next boot EC2Launch v2 logs: “UserData persist disabled” Fix: EC2Launch.exe sysprep --shutdown OR reset --schedule (last Packer step) Packer hardening — the cleanup step that prevents both variants Linux (last shell provisioner) sudo cloud-init clean --logs sudo rm -rf /var/lib/cloud/sem/* /var/lib/cloud/instance sudo rm -f /etc/machine-id && sudo touch /etc/machine-id sudo truncate -s 0 /etc/hostname sudo rm -rf /root/.ssh /home/ec2-user/.ssh Windows (last PowerShell provisioner) & "C:\Program Files\Amazon\EC2Launch\EC2Launch.exe" reset & "C:\Program Files\Amazon\EC2Launch\EC2Launch.exe" sysprep --shutdown Validation post-bake (sandbox launch) curl http://169.254.169.254/latest/user-data # rendered ok? cloud-init status --long # should show 'done' on first boot ls /var/lib/cloud/sem/ # should be empty probe /tmp/userdata-marker exists # custom canary Image promotion gate (Terraform) AMI must carry tag PackerCleanupRun=true — else SSM parameter refuses promotion. Lifecycle precondition prevents bad bakes from reaching prod ASGs.
      SimEC2 lifecycleS003 · 3/8

      Hypotheses

      #HypothesisDisproof
      H1AMI baked with cloud-init semaphores already present (Linux)ls /var/lib/cloud/sem/ on baked AMI
      H2AMI baked w/o running EC2Launch SysprepInstance (Win)EC2Launch.exe sysprep --shutdown log
      H3User-data MIME multi-part missing Content-Type: text/x-shellscripthead -c 500 /var/lib/cloud/instance/user-data.txt
      H4#cloud-config YAML invalid — cloud-init silently no-opscloud-init schema --system
      H5Launch Template v6 has empty UserData fielddescribe-launch-template-versions

      Quick path

      1. Pull the rendered user-data from IMDS — if it's empty, the LT is the bug.
      2. If it's present, check cloud-init status --long and journalctl -u cloud-final.
      3. Windows: Get-Content C:\ProgramData\Amazon\EC2Launch\log\agent.log -Tail 200
      Cheeky: IMDS exposes the rendered user-data at http://169.254.169.254/latest/user-data. If it's wrong there, the LT is wrong. If it's right there but didn't run, it's the AMI.
      LabEC2 lifecycleS003 · 4/8

      Diagnose

      Linux

      # IMDSv2 token first
      TOKEN=$(curl -s -X PUT http://169.254.169.254/latest/api/token \
        -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
      
      # Rendered user-data
      curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
         http://169.254.169.254/latest/user-data | head -40
      
      # cloud-init status + log
      cloud-init status --long
      sudo journalctl -u cloud-final --no-pager | tail -100
      
      # Look for stale semaphores baked into AMI
      ls -la /var/lib/cloud/sem/
      ls -la /var/lib/cloud/instance/

      Windows

      # EC2Launch v2 task state
      Get-Service AmazonSSMAgent
      Get-Content "C:\ProgramData\Amazon\EC2Launch\log\agent.log" `
         -Tail 200
      
      # Has UserData been marked “run-once”?
      Test-Path "C:\ProgramData\Amazon\EC2Launch\state\.run-once"
      
      # Re-arm UserData for next boot
      & "C:\Program Files\Amazon\EC2Launch\EC2Launch.exe" reset --schedule
      
      # Check the rendered user-data
      Invoke-WebRequest -Headers @{"X-aws-ec2-metadata-token"=$tk} `
         -Uri "http://169.254.169.254/latest/user-data"
      SimEC2 lifecycleS003 · 5/8

      Root cause

      What we found

      • The AMI was baked with Packer. The bake step ran the user-data during the bake (to pre-install agents). Packer didn't clean /var/lib/cloud/sem/ before aws ec2 create-image.
      • cloud-init on next boot saw the semaphore for “already-ran-this-instance” and skipped scripts-user.
      • Windows side: same idea but EC2Launch v2's .run-once flag survived because the bake skipped EC2Launch.exe sysprep.
      Gotcha: custom AMI bakes that run user-data during the bake (a common “burn-in” pattern) must clean cloud-init state and Windows EC2Launch state before the image snapshot, or no instance launched from the AMI ever runs user-data again.
      IaCEC2 lifecycleS003 · 6/8

      Fix — Packer cleanup + Terraform AMI gate

      Packer provisioner (Linux)

      # last provisioner before snapshot
      provisioner "shell" {
        inline = [
          "sudo cloud-init clean --logs",
          "sudo rm -rf /var/lib/cloud/sem/* /var/lib/cloud/instance",
          "sudo rm -f /etc/machine-id && sudo touch /etc/machine-id",
          "sudo truncate -s 0 /etc/hostname",
          "sudo rm -rf /root/.ssh /home/ec2-user/.ssh"
        ]
      }

      Packer provisioner (Windows)

      provisioner "powershell" {
        inline = [
          "& 'C:\\Program Files\\Amazon\\EC2Launch\\EC2Launch.exe' reset",
          "& 'C:\\Program Files\\Amazon\\EC2Launch\\EC2Launch.exe' sysprep --shutdown"
        ]
      }

      Terraform AMI promotion gate

      # in tools-cicd: promotion job
      resource "aws_ssm_parameter" "prod_ami_id" {
        name  = "/gc/prod/ami/orders-api"
        type  = "String"
        value = data.aws_ami.candidate.id
        lifecycle { precondition {
          condition     = data.aws_ami.candidate.tags["PackerCleanupRun"] == "true"
          error_message = "AMI must be tagged PackerCleanupRun=true."
        } }
      }
      IaC gate: the bake job tags the AMI with PackerCleanupRun=true only after the cleanup step. Promotion to prod refuses without that tag.
      ConceptEC2 lifecycleS003 · 7/8

      Cheeky & prevention

      Cheeky #1

      Force user-data to re-run on next boot via SSM — no console:
      aws ssm send-command --document-name AWS-RunShellScript --parameters 'commands=["sudo cloud-init clean --logs && sudo cloud-init init"]'

      Cheeky #2

      Switch the bake from “run user-data” to a cfn-init-style metadata pull. Move agent installs into Image Builder components — AMI ships ready, user-data only does instance-specific config.

      Cheeky #3

      Test in the bake: add a Packer post-processor that launches the AMI in a sandbox subnet with a probe user-data; if probe doesn't run, fail the build.

      Prevent #1

      EventBridge rule on EC2 Instance State-change Notification with state=running and a Lambda that probes IMDS user-data & cloud-init status; emits CW custom metric UserDataExecuted=0/1.

      Prevent #2

      Config rule: ec2-instance-managed-by-systems-manager (catches the wider problem — if your bake breaks SSM agent registration too).

      Prevent #3

      Bake CI uploads a bake-report.json to a central bucket; the AMI promotion job validates the report contains cloud_init_clean: true.

      LabEC2 lifecycleS003 · 8/8

      Interactive lab

      Lab S003: Diagnose user-data didn't runsimulated
      Objectives
        $
        0 / 0
        ConceptEC2 lifecycleS004 · 1/8

        Symptom — instance has the wrong identity from IMDS

        Observed

        • App calls aws sts get-caller-identity from inside the instance and gets old role FedAdmin — not the expected orders-api-task-role.
        • Result: writes to S3 fail with AccessDenied; reads from a different bucket succeed.
        • Started after the team replaced the instance profile via Terraform.

        Constraints

        Launch pathASG → Launch Template (just bumped to v7)
        Old profileip-orders-api-v1 with role FedAdmin (yes, sloppy)
        New profileip-orders-api-v2 with role orders-api-task-role
        EC2 metadata cachecredentials cached by SDK for ~6 hr
        Note: EC2's IMDS rotates creds 6 hours before expiry, but if the profile swap happens after launch, the previous role's creds may persist until rotation OR until the instance/agent is restarted.
        VisualEC2 lifecycleS004 · 2/8

        Profile vs role vs cached credentials

        EC2 + IMDS IAM control plane app SDK cache divergence (the bug) ⓘ hover for detail
        IAM control plane (gc-prod-app) profile v2 orders-task-role old profile Why instances kept old role Terraform changes LT — affects new launches only Existing instances retain prior association Need replace-iam-instance-profile-association per inst OR ASG instance-refresh with LT trigger Plus: SDK in-process cache must invalidate EC2 i-0xx + IMDS EC2 IMDSv2 profile attr role IMDS facts Role name = profile's only role (1:1) Creds rotate ~6h before expiry Profile flip propagates < 2s typical SCP requires HttpTokens=required (IMDSv2) Any client that does GET-only fails after SCP flip SDK / app process JVM cred provider caller-id IMDS ≠ SDK cached creds IMDS: orders-task-role SDK: FedAdmin (still cached) Fix: restart process OR force-rotate via reassoc + ASG triggers=["launch_template"] Three remediation patterns — pick by service shape Pattern A — ASG instance refresh on LT trigger Terraform: instance_refresh.triggers = ["launch_template"] Profile change in LT → ASG rolls fleet → new launches use new profile Cattle, not pets · safest in prod Pattern B — replace-association + restart aws ec2 replace-iam-instance-profile-association ... SSM Run Command: systemctl restart orders-api Use when fleet is small or refresh too disruptive Pattern C — SDK self-validation App on boot: aws sts get-caller-identity Compare ARN to expected; panic-exit if divergent ASG kills + replaces → aligns SDK with IMDS Synthetic monitor Lambda calls get-caller-identity on every host every 1 min Emits CloudWatch metric InstanceRoleId per host Alarm on divergence from expected for > 10 min EventBridge audit Rule on AssociateIamInstanceProfile + ReplaceIamInstanceProfileAssociation Notify Slack #iam-changes with caller + diff Config rule: ec2-instance-profile-attached + name regex
        SimEC2 lifecycleS004 · 3/8

        Hypotheses

        #HypothesisDisprove
        H1Profile swap not yet applied to running instancesdescribe-iam-instance-profile-associations
        H2Profile applied, but SDK cached old credsIMDS shows new role; SDK shows old
        H3Instance manually overrides creds via envprintenv | grep AWS_
        H4App container is using a task role from ECS not EC2curl $AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
        H5Profile resource policy denies AssumeRole on new roleCloudTrail AssumeRole error
        Cheeky: the name of the role in IMDS is the source of truth. curl http://169.254.169.254/latest/meta-data/iam/security-credentials/ returns the role currently associated with the instance profile. If that's wrong, IMDS hasn't flipped yet.
        Gotcha: SDKs (especially older boto3, AWS Java v1) cache credentials in-process for the entire SDK lifetime once obtained, ignoring TTL refresh in some configs. Restart the app process or rotate the SDK cred provider.
        LabEC2 lifecycleS004 · 4/8

        Diagnose

        # 1. What does the API say is associated?
        aws ec2 describe-iam-instance-profile-associations \
          --filters Name=instance-id,Values=i-0xx \
          --query 'IamInstanceProfileAssociations[].{S:State,Arn:IamInstanceProfile.Arn}'
        
        # 2. What does IMDS say?
        TK=$(curl -s -X PUT http://169.254.169.254/latest/api/token \
           -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
        curl -s -H "X-aws-ec2-metadata-token: $TK" \
          http://169.254.169.254/latest/meta-data/iam/security-credentials/
        
        # 3. SDK says what?
        aws sts get-caller-identity
        # 4. Force-rotate by re-associating profile
        ASSOC=$(aws ec2 describe-iam-instance-profile-associations \
          --filters Name=instance-id,Values=i-0xx \
          --query 'IamInstanceProfileAssociations[0].AssociationId' --output text)
        
        aws ec2 replace-iam-instance-profile-association \
          --association-id $ASSOC \
          --iam-instance-profile Name=ip-orders-api-v2
        
        # 5. Restart the app or SSM agent
        sudo systemctl restart orders-api
        sudo systemctl restart amazon-ssm-agent
        
        # 6. Confirm
        sleep 30 && aws sts get-caller-identity
        Cheeky: if you must keep the app process alive, force credential refresh by setting AWS_EC2_METADATA_DISABLED=false + clearing the SDK's in-memory cache (boto3: session.get_credentials().refresh()).
        SimEC2 lifecycleS004 · 5/8

        Root cause

        Why the wrong creds persisted

        1. Terraform changed aws_iam_instance_profile from ip-orders-api-v1 to ip-orders-api-v2 on the launch template.
        2. Existing instances kept the old association — LT changes affect new launches only.
        3. The team did terraform apply and assumed the fleet refreshed. ASG only refreshes on instance-refresh or scale events.
        4. The SDK in the long-running app process had cached creds for the old role for hours.
        Gotcha: changing IAM roles via Terraform does not re-associate already-running instances. Either trigger an ASG instance-refresh, OR script replace-iam-instance-profile-association across the fleet.
        IaCEC2 lifecycleS004 · 6/8

        Fix — force fleet refresh on profile change

        resource "aws_launch_template" "orders_api" {
          name_prefix = "orders-api-"
          iam_instance_profile { name = aws_iam_instance_profile.orders_v2.name }
          user_data = base64encode(templatefile("ud.sh.tftpl",{}))
          metadata_options {
            http_tokens                 = "required"
            http_put_response_hop_limit = 2
          }
          tag_specifications { resource_type="instance"; tags=local.tags }
        }
        
        resource "aws_autoscaling_group" "orders_api" {
          ...
          launch_template { id=aws_launch_template.orders_api.id; version="$Latest" }
          instance_refresh {
            strategy = "Rolling"
            preferences { min_healthy_percentage = 90 }
            triggers = ["launch_template"]   # <-- key
          }
        }

        Why triggers = ["launch_template"] matters

        • By default ASG instance-refresh only triggers on AMI changes.
        • Adding "launch_template" means any LT version bump (incl. instance profile) auto-rolls the fleet.
        • Combined with min_healthy_percentage = 90, the rollout is safe.
        IaC note: remember to lifecycle-block force_delete; some shops also gate on a checkov rule that requires http_tokens = "required" (IMDSv2) on every launch template — matches the SCP guardrail.
        ConceptEC2 lifecycleS004 · 7/8

        Cheeky & prevention

        Cheeky #1

        Use aws sts get-caller-identity output as the source of truth in app boot logs. If the assumed role doesn't match expected, panic-exit the process — let ASG kill and replace.

        Cheeky #2

        One profile, multiple roles? Not possible. Instance profiles take exactly one role. Use STS chain (orders-api-bootstrap-roleorders-api-task-role) for runtime privilege downgrade.

        Cheeky #3

        Set SDK AWS_METADATA_SERVICE_TIMEOUT=2 + AWS_METADATA_SERVICE_NUM_ATTEMPTS=3 so credential rotation issues fail loudly, not silently.

        Prevent

        Synthetic canary calls get-caller-identity every minute, emits a CW metric InstanceRoleId. Alarm if it diverges from expected for > 10 min.

        Prevent

        EventBridge rule on AssociateIamInstanceProfile + ReplaceIamInstanceProfileAssociation → Slack #iam-changes.

        Prevent

        Config rule: ec2-instance-profile-attached, plus a custom rule that asserts the profile name matches expected per environment tag.

        LabEC2 lifecycleS004 · 8/8

        Interactive lab

        Lab S004: Find the stuck instance profilesimulated
        Objectives
          $
          0 / 0
          ConceptEC2 lifecycleS005 · 1/8

          Symptom — Client.InvalidAMIID.NotFound

          Observed

          • ASG can't scale: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation.
          • Same AMI ID worked yesterday. Console UI shows AMI in source acct but not in spoke.
          • Recent change: tools-cicd team rotated AMI bake pipeline; old AMIs deregistered.

          Constraints

          ItemValue
          Source acctgc-tools-cicd (1212...)
          Spoke acctgc-prod-app (6666...)
          AMI IDami-0abc123def456
          Region scopeus-east-1 only (no copy to eu-west-1)
          Note: AMI IDs are per-region per-owner. The same AMI copied to eu-west-1 will have a different ID.
          VisualEC2 lifecycleS005 · 2/8

          AMI sharing topology

          Source AMI lifecycle Spoke ASG (consumer) SSM Parameter Store (the fix) deregistered (the bug) ⓘ hover for detail
          gc-tools-cicd (121212121212) · us-east-1 CURRENT · ami-0newxxx AMI snap launch-perm KMS DEREGISTERED · ami-0abc123def456 deregistered orphan audit trail pipeline Pipeline retention contract Keep latest 5 AMIs Plus any AMI referenced by a non-deleted LT in any spoke Deregister + tag-archive metadata to S3 Without ‘referenced-by-LT’ rule, this scenario repeats Symptom signature An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation Same error for: never-existed, deregistered, no-permission Disambiguate via CloudTrail in source acct (DeregisterImage event) SSM Parameter Store (gc-tools-cicd) /gc/prod/ami/orders-api .../previous RAM share EventBridge Terraform reference (spoke side) data "aws_ssm_parameter" "ami" { name = "/gc/prod/ami/orders-api" } resource "aws_launch_template" "x" { image_id = data.aws_ssm_parameter.ami.value } Bonus: native LT shorthand image_id = "resolve:ssm:/gc/prod/ami/orders-api" No data source needed; resolved at launch time. DR region mirror Lambda subscribes to ParameterChange event Copies AMI to eu-west-1 with cross-region snapshot copy Writes /gc/prod/ami/orders-api in eu-west-1 Spokes in eu-west-1 see the eu AMI ID transparently gc-prod-app (666...) ASG LT EC2 Roll-forward / roll-back Roll-forward: bake pipeline writes new AMI to param → spoke Tf plan picks up next apply (or daily refresh) Roll-back: swap value with /previous + start instance refresh → ASG kills new + replaces with prior AMI Health gate instance_refresh with auto_rollback=true checks ELB target health during refresh if < min_healthy_percentage → revert
          SimEC2 lifecycleS005 · 3/8

          Hypotheses

          #HypothesisDisprove
          H1AMI deregistered in source acctaws ec2 describe-images --owners 1212 --image-ids ami-0xx
          H2Launch permission revoked for spokedescribe-image-attribute --attribute launchPermission
          H3Wrong region (LT in eu-west-1 referencing us-east-1 AMI)region in LT vs ASG
          H4Encrypted snapshot share missing (AMI is encrypted)describe-snapshot-attribute
          Cheeky: the InvalidAMIID error message is the same whether the AMI never existed, was deregistered, or your account lacks permission. Add CloudTrail lookup of DeregisterImage in the source account to disambiguate fast.
          LabEC2 lifecycleS005 · 4/8

          Diagnose

          # 1. Does the AMI exist for the source owner?
          aws ec2 describe-images --owners 121212121212 \
            --image-ids ami-0abc123def456 \
            --query 'Images[].{ID:ImageId,State:State,Name:Name}' \
            --output table
          
          # 2. Was it deregistered?
          aws --profile gc-tools cloudtrail lookup-events \
            --lookup-attributes AttributeKey=ResourceName,AttributeValue=ami-0abc123def456 \
            --max-results 5 --query 'Events[].{T:EventTime,N:EventName,U:Username}'
          # 3. Resolve via SSM parameter (what should happen)
          aws ssm get-parameter --name /gc/prod/ami/orders-api \
            --query Parameter.Value --output text
          
          # 4. Check launch permission
          aws --profile gc-tools ec2 describe-image-attribute \
            --image-id ami-0abc123def456 --attribute launchPermission
          
          # 5. Check snapshot share for encrypted AMI
          aws --profile gc-tools ec2 describe-snapshot-attribute \
            --snapshot-id snap-0xx --attribute createVolumePermission
          SimEC2 lifecycleS005 · 5/8

          Root cause

          Story

          1. The bake pipeline retains the latest 5 AMIs and deregisters the rest.
          2. An ASG launch template was pinned (hard-coded) to the AMI ID, not an SSM parameter.
          3. When the pipeline rotated, the pinned AMI was deregistered.
          4. ASG tried to scale; RunInstances exploded with InvalidAMIID.NotFound.
          Gotcha: “the AMI exists, just check the console” — if you're looking at the source account, you may see it because deregistered AMIs vanish from the spokeview first while still showing in the owning account for a while.
          IaCEC2 lifecycleS005 · 6/8

          Fix — SSM-parameter-driven AMI ID

          data "aws_ssm_parameter" "orders_ami" {
            name = "/gc/prod/ami/orders-api"
          }
          
          resource "aws_launch_template" "orders_api" {
            image_id = data.aws_ssm_parameter.orders_ami.value
            ...
          }
          
          # in tools-cicd: write parameter on every promotion
          resource "aws_ssm_parameter" "prod_ami" {
            name      = "/gc/prod/ami/orders-api"
            type      = "String"
            data_type = "aws:ec2:image"
            value     = aws_ami_copy.candidate.id
            overwrite = true
          }

          Cross-account read of the parameter

          • Use RAM to share the parameter to spoke accounts (or cross-account permissions on the SSM parameter for SSM Advanced — not available for Standard).
          • Spokes reference via shared parameter; Terraform provider supports this with a cross-account aws alias.
          IaC note: data_type = "aws:ec2:image" makes SSM validate the AMI ID format and existence at write-time — you can't accidentally write a typo.
          ConceptEC2 lifecycleS005 · 7/8

          Cheeky & prevention

          Cheeky #1

          EC2 LaunchTemplate accepts resolve:ssm:/gc/prod/ami/orders-api directly in image_id — no data source needed.

          Cheeky #2

          Keep the prior AMI: bake step writes /gc/prod/ami/orders-api/previous. Roll-back is one parameter version flip + ASG instance-refresh.

          Cheeky #3

          For DR region, mirror the parameter via Lambda in tools-cicd that runs on parameter change and copies to eu-west-1 with the eu-west-1 AMI ID.

          Prevent

          EventBridge on DeregisterImage; if the deregistered AMI is referenced by any LT (search via Config), page the team.

          Prevent

          Bake pipeline keeps last 5 AMIs plus any AMI referenced by a non-deleted LT (cross-account introspection).

          Prevent

          ASG instance-refresh with auto-rollback on health failure: rolling out a bad AMI auto-reverts.

          LabEC2 lifecycleS005 · 8/8

          Interactive lab

          Lab S005: Trace the missing AMIsimulated
          Objectives
            $
            0 / 0
            ConceptEC2 lifecycleS006 · 1/8

            Symptom — instance has no public IP

            Observed

            • Bastion in DMZ subnet launches but only has private IP. No public IP assigned.
            • Vendor partner can't SSH from internet via the EIP that was supposed to be auto-attached.

            Constraints

            Subnetsubnet-dmz-use1a (10.20.0.0/24)
            Auto-assign IPwas set; SCP recently flipped it off org-wide
            EIP allocationrequested in user-data via aws ec2 associate-address
            Instance rolelacks ec2:AssociateAddress
            VisualEC2 lifecycleS006 · 2/8

            How “public IP” actually works

            Public IP options SCP guardrail where it fails better pattern (SSM) ⓘ hover for detail
            Public IP options — pick by lifecycle A · Subnet auto-assign transient · changes on stop/start SCP-blocked in prod B · Elastic IP durable across stop/start billable when not associated C · ENI + EIP most stable for vendor IPs survives ASG instance refresh SCP guardrail (org-wide) SCP NoAutoPublicIp Organizations Implications → public IPs only via explicit EIP assoc → auditable (CloudTrail AssociateAddress event) → predictable IP across stop/start → user-data must run associate-address itself → instance role needs ec2:AssociateAddress Where it fails — sequence EC2 bastion role missing perm EIP unassociated vendor blocked Failure invariants in this org 1. SCP forces explicit EIP path 2. user-data must call associate-address itself 3. role permission must include ec2:AssociateAddress 4. user-data must set -euo pipefail + probe at end 5. EventBridge alarm if PublicIpAddress null after running Recommended: replace public bastion with SSM Session Manager + port forwarding vendor SAML SSM tunnel bastion target Why this is the right call + no public IP → SCP & security review trivial + no SSH key distribution to vendors + every session logged in CloudTrail + session can be terminated centrally + no EIP quota issues + works across regions transparently − vendor needs aws-cli + Okta access − partner integrations may not support SSM (then use EIP+ENI)
            SimEC2 lifecycleS006 · 3/8

            Hypotheses

            #HypothesisDisprove
            H1Subnet auto-assign disabled (SCP)describe-subnets MapPublicIpOnLaunch
            H2EIP not associated — user-data role missing permcloud-init log + IAM SimulatePrincipalPolicy
            H3EIP exhausted — account quotadescribe-account-attributes
            H4EIP allocated in different regiondescribe-addresses --region eu-west-1
            Cheeky: use aws iam simulate-principal-policy --policy-source-arn <role> --action-names ec2:AssociateAddress --resource-arns * to prove perm without running anything.
            LabEC2 lifecycleS006 · 4/8

            Diagnose

            # 1. Subnet flag
            aws ec2 describe-subnets --subnet-ids subnet-dmz-use1a \
              --query 'Subnets[].{Auto:MapPublicIpOnLaunch,IPs:AvailableIpAddressCount}'
            
            # 2. Instance state
            aws ec2 describe-instances --instance-ids i-0xx \
              --query 'Reservations[].Instances[].{Pub:PublicIpAddress,Priv:PrivateIpAddress}'
            
            # 3. EIP available?
            aws ec2 describe-addresses \
              --query 'Addresses[?AssociationId==`null`].PublicIp'
            # 4. Did user-data fail silently?
            sudo cat /var/log/cloud-init-output.log | grep -i associate
            
            # 5. IAM perm proof
            aws iam simulate-principal-policy \
              --policy-source-arn arn:aws:iam::666...:role/bastion-role \
              --action-names ec2:AssociateAddress \
              --resource-arns arn:aws:ec2:us-east-1:666...:elastic-ip/eipalloc-0xx
            
            # 6. Manually associate (fix attempt)
            aws ec2 associate-address \
              --instance-id i-0xx --allocation-id eipalloc-0xx
            SimEC2 lifecycleS006 · 5/8

            Root cause

            1. Org SCP NoAutoPublicIp denies RunInstances with AssociatePublicIpAddress=true on ENIs.
            2. User-data tried to associate-address but the bastion role only had ec2:DescribeAddresses, not ec2:AssociateAddress.
            3. User-data didn't fail-fast on error (no set -e); the instance reached running with no public IP and no alarm.
            Gotcha: associate-address requires perm on the EIP allocation and on the instance ENI. Forgetting the ENI ARN is a frequent IAM cause.
            IaCEC2 lifecycleS006 · 6/8

            Fix

            resource "aws_iam_role_policy" "bastion_eip" {
              role = aws_iam_role.bastion.id
              policy = jsonencode({
                Version="2012-10-17",
                Statement=[{
                  Effect="Allow",
                  Action=["ec2:AssociateAddress","ec2:DisassociateAddress",
                          "ec2:DescribeAddresses"],
                  Resource="*",
                  Condition={
                    StringEquals={"aws:ResourceTag/Role"="bastion"}
                  }
                }]
              })
            }
            # user-data hardening
            provisioner "file" {
              content = <<-EOT
                #!/usr/bin/env bash
                set -euo pipefail
                TOKEN=$(curl -s -X PUT \
                  http://169.254.169.254/latest/api/token \
                  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
                INST=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
                  http://169.254.169.254/latest/meta-data/instance-id)
                aws ec2 associate-address \
                  --instance-id $INST --allocation-id ${var.eip_alloc_id}
              EOT
            }
            IaC note: use set -euo pipefail and probe at end — aws ec2 describe-instances --instance-ids $INST --query 'Reservations[].Instances[].PublicIpAddress' — if blank, exit 1 → ASG kills.
            ConceptEC2 lifecycleS006 · 7/8

            Cheeky & prevention

            Cheeky #1

            For bastions, prefer SSM Session Manager with port-forwarding — no public IP, no SSH key, fully audited.

            Cheeky #2

            Tag the EIP with Role=bastion + InstanceTag=bastion-prod. Use IAM aws:ResourceTag condition to scope ec2:AssociateAddress to only EIPs you own.

            Cheeky #3

            Avoid auto-assign public IP at the subnet level for any production tier — it's implicit and easy to leak. Always EIP+explicit assoc.

            Prevent

            EventBridge on EC2 Instance State-change Notification · running + Lambda asserts PublicIpAddress != null for tagged bastions.

            Prevent

            Synthetic canary: every 1 min, attempt nc -vz from external runner to bastion EIP:22; alarm on failure.

            Prevent

            Config rule: elastic-ip-required-tags + custom rule that flags any unassociated EIP > 1 day old (cost).

            LabEC2 lifecycleS006 · 8/8

            Interactive lab

            Lab S006: Make the bastion reachablesimulated
            Objectives
              $
              0 / 0
              ConceptEC2 lifecycleS007 · 1/8

              Symptom — stop/start broke internal DNS

              Observed

              • App was patched: stop → start. Came back with new private IP (10.20.10.9910.20.10.121).
              • Microservice orders-api.gcaws.internal still resolves to old IP for ~10 minutes.
              • Downstream services 5xx with connection refused.

              Constraints

              DNSR53 PHZ gcaws.internal (associated to prod-app VPC)
              RecordA orders-api → literal IP, TTL 60
              Update pathmanual today; nobody updated the record
              AppJava app, DNS cached forever (default sec.policy)
              Note: Java caches DNS based on networkaddress.cache.ttl; default is JVM · until process restart. Add -Dsun.net.inetaddr.ttl=60.
              VisualEC2 lifecycleS007 · 2/8

              Why stop/start gets a new IP

              before fix (stale state) R53 + JVM cache better patterns ⓘ hover for detail
              Stop → Start — the IP changes running .99 stopped running .121 PHZ stale JVM cache Why this is double trouble 1. PHZ record stale → resolver returns old IP for ~minutes 2. JVM caches DNS for the lifetime of the process 3. Even fixing PHZ doesn't help apps already cached — restart needed Better patterns — choose by shape Path A: ALB/NLB target group orders-api.gcaws.internal → ALB alias Path B: secondary ENI PHZ → ENI's stable private IP Path C: auto-update via EventBridge single-instance escape hatch Path D: ECS service discovery native ECS pattern JVM DNS cache — fix the second half of the bug JVM property (boot) -Dnetworkaddress.cache.ttl=60 -Dnetworkaddress.cache.negative.ttl=10 Set in /etc/orders-api.env or systemd unit SDK / runtime AWS Java SDK v2 has its own caching layer Override via system property at JVM start For node/python: typically OS resolver only Anti-patterns to avoid A-record to literal private IP — we hit this CNAME chain to instance hostname Stop/start in prod (replace via ASG instead) Detection — synthetic canary on every named PHZ entry Layered: prevent (tflint) → detect (canary) → alert (EventBridge) → remediate (Config rule) Outcome: stop/start no longer triggers a 30-min outage on dependent services. SLO: DNS-target consistency < 60s · tracked alongside availability SLO
              SimEC2 lifecycleS007 · 3/8

              Hypotheses

              #HypothesisDisprove
              H1PHZ record stalelist-resource-record-sets → compare to instance IP
              H2Client DNS cache (Java)jcmd <pid> VM.system_properties | grep ttl
              H3Connection pool pinned to old socketapp metric / process restart fixes
              H4NLB cross-zone disabled, target re-registration delayeddescribe-target-health
              Cheeky: for any “DNS cache” suspicion, getent hosts orders-api.gcaws.internal on the host queries the resolver directly. If it's right but the app sees old IP, it's the JVM/SDK cache.
              LabEC2 lifecycleS007 · 4/8

              Diagnose

              # 1. Current PHZ record
              aws route53 list-resource-record-sets \
                --hosted-zone-id Z0XXX \
                --query "ResourceRecordSets[?Name=='orders-api.gcaws.internal.']"
              
              # 2. Current instance IP
              aws ec2 describe-instances --instance-ids i-0xx \
                --query 'Reservations[].Instances[].PrivateIpAddress'
              
              # 3. From inside the host
              getent hosts orders-api.gcaws.internal
              dig +short orders-api.gcaws.internal
              # 4. Update the record (immediate fix)
              aws route53 change-resource-record-sets \
                --hosted-zone-id Z0XXX --change-batch file://upsert.json
              
              # 5. Force JVM to re-resolve (cheeky)
              sudo systemctl restart orders-api    # cleanest
              # or via JMX:
              jcmd <pid> VM.system_properties | grep -i ttl
              
              # 6. Confirm
              ss -tnp | grep orders-api    # new sockets to right IP
              SimEC2 lifecycleS007 · 5/8

              Root cause

              1. The team used a hand-managed PHZ A record pointing at a literal private IP.
              2. Stop/start changed the IP. Nobody updated DNS.
              3. Even after updating, the Java app cached the old DNS for the JVM lifetime.
              Gotcha: A-records to instance IPs are an anti-pattern at any scale. Service discovery (ECS / Cloud Map / R53 ARC), an ALB target group, or a secondary ENI is the right primitive.
              IaCEC2 lifecycleS007 · 6/8

              Fix — ALB or auto-update Lambda

              Path A: front with ALB / NLB

              resource "aws_lb" "orders" { internal=true; load_balancer_type="application"; ... }
              resource "aws_lb_target_group" "orders" { ... }
              resource "aws_route53_record" "orders" {
                zone_id = data.aws_route53_zone.gcaws.zone_id
                name    = "orders-api"
                type    = "A"
                alias { name = aws_lb.orders.dns_name; zone_id = aws_lb.orders.zone_id; evaluate_target_health = true }
              }

              Path B: auto-update via EventBridge + Lambda

              resource "aws_cloudwatch_event_rule" "ec2_state" {
                event_pattern = jsonencode({
                  source      = ["aws.ec2"],
                  detail-type = ["EC2 Instance State-change Notification"],
                  detail      = { state=["running"] }
                })
              }
              resource "aws_lambda_function" "phz_updater" { ... }
              # Lambda reads instance tag DnsName, upserts PHZ record
              IaC note: Path A is preferred. Path B is the “single instance, no LB” escape hatch (e.g., bastion-style services). Both are codified.
              ConceptEC2 lifecycleS007 · 7/8

              Cheeky & prevention

              Cheeky #1

              JVM DNS cache fix without app restart: java.security.Security.setProperty("networkaddress.cache.ttl","60") at boot, or env-level JAVA_OPTS=-Dsun.net.inetaddr.ttl=60.

              Cheeky #2

              Avoid stop/start on prod EC2 entirely — replace the instance via ASG instance-refresh. Cattle, not pets.

              Cheeky #3

              Need a stable IP without LB? Attach a secondary ENI you provision separately. ENI persists; primary IP is on the ENI; the ENI moves with the instance.

              Prevent

              Synthetic canary on every named PHZ entry — periodically validates DNS-vs-target IP. Alarm on divergence > 5 min.

              Prevent

              Config rule: route53-records-only-pointing-to-running-resources (custom).

              Prevent

              SCP doesn't directly help. Lint rule in Terraform: forbid aws_route53_record with type=A and records=[] — force ALB-alias.

              LabEC2 lifecycleS007 · 8/8

              Interactive lab

              Lab S007: Find the stale DNS, fix it, prove cachesimulated
              Objectives
                $
                0 / 0
                ConceptEC2 lifecycleS008 · 1/8

                Symptom — InsufficientInstanceCapacity

                Observed

                • ASG can't scale: We currently do not have sufficient c6i.4xlarge capacity in the AZ you requested (us-east-1a).
                • ASG configured with single AZ + single instance type (legacy).
                • Happens at 9am traffic peak on weekdays.

                Constraints

                ASG AZsus-east-1a only (legacy)
                Instance typec6i.4xlarge only
                Capacity reservationnone
                SCP region lockus-east-1, eu-west-1
                Note: AWS doesn't guarantee on-demand capacity per-AZ per-type. The fix is diversification (mixed-instances) or ODCR (On-Demand Capacity Reservation).
                VisualEC2 lifecycleS008 · 2/8

                Capacity diversification topology

                pinned (broken) diversified (fix) ODCR floor ⓘ hover for detail
                Today: pinned (capacity roulette) ASG 1 AZ 1 type ICE alarm Why bigger types are MORE constrained AWS bin-packs hosts; large types claim whole hosts "Use larger headroom" actually backfires 9am peak hits same shape across many tenants — AZ saturates Diversify down (more, smaller) or across families Diversified: mixed-instances policy 3 AZ c6i c6a c5 m6i policy refresh SPS API Why this works 12 capacity pools (3 AZ × 4 types) — vastly diversified Spot capacity-optimized: AWS picks AZ+type with most headroom On-demand base 4: SLO floor never on spot Burst (4→40) flows into best-available pool Bonus: attribute-based selection (vCPU 16, mem 32) opens more pools Capacity Reservation (ODCR) ODCR x 4 'open' pref utilization Floor + diversification = SLO defense 4 ODCR seats: SLO floor Burst on-demand into 12-pool space Spot above on-demand for cost ICE in spot doesn't kill SLO Detection & escalation flow ASG event EB rule enricher page Annual capacity review (Q4) Service Quotas raise: L-1216C47A standard on-demand vCPU ODCR sized to next year's traffic forecast Mixed-instances overrides reviewed for new types (e.g. c7i, m7i) Pre-flight check (CI) aws ec2 get-spot-placement-scores \ --instance-types c6i.4xlarge ... --target-capacity 8 --region-names us-east-1 Score < 7 in CI → warn; choose alternate region Custom Config rule (governance) prod ASGs must specify mixed_instances_policy with at least 3 overrides prod ASGs must use at least 2 AZs in vpc_zone_identifier Non-compliant resources flagged in dashboard; auto-ticket
                SimEC2 lifecycleS008 · 3/8

                Hypotheses

                #HypothesisDisprove
                H1Genuine AZ capacity shortage at peakStateReason + EventBridge ASG events
                H2Account-level on-demand vCPU quota hitService Quotas: L-1216C47A
                H3Subnet IP exhausted (looks similar)describe-subnets AvailableIpAddressCount
                H4SCP denies new types beyond approved listsimulate run-instances
                Cheeky: ICE errors are AZ-specific. Try the same instance type via aws ec2 run-instances --dry-run in each AZ — you'll see which AZ has capacity right now.
                LabEC2 lifecycleS008 · 4/8

                Diagnose

                # 1. ASG scaling activity
                aws autoscaling describe-scaling-activities \
                  --auto-scaling-group-name orders-asg --max-records 10 \
                  --query 'Activities[].{T:StartTime,S:StatusCode,M:StatusMessage}'
                
                # 2. Quota
                aws service-quotas get-service-quota \
                  --service-code ec2 --quota-code L-1216C47A
                
                # 3. Subnet IPs
                aws ec2 describe-subnets --subnet-ids subnet-priv-use1a \
                  --query 'Subnets[].AvailableIpAddressCount'
                # 4. Probe other AZs (dry-run)
                for az in us-east-1a us-east-1b us-east-1c; do
                  echo $az
                  aws ec2 run-instances --dry-run --instance-type c6i.4xlarge \
                    --image-id ami-0xx --subnet-id $(subnet_for $az) \
                    --query Errors --output text 2>&1 | head -2
                done
                
                # 5. ODCR check
                aws ec2 describe-capacity-reservations \
                  --filters Name=state,Values=active \
                  --query 'CapacityReservations[].{T:InstanceType,AZ:AvailabilityZone,Avail:AvailableInstanceCount}'
                SimEC2 lifecycleS008 · 5/8

                Root cause

                1. Genuine AWS capacity shortage in us-east-1a for c6i.4xlarge at 9am peak (regional event affecting many tenants).
                2. ASG was pinned to 1 AZ + 1 type for “deterministic placement” (legacy, no longer needed).
                3. No fallback type, no ODCR, no spot.
                Gotcha: “Use larger types for headroom” backfires — bigger types are more capacity-constrained. Diversify down (more, smaller) or across families.
                IaCEC2 lifecycleS008 · 6/8

                Fix — mixed-instances + ODCR

                resource "aws_autoscaling_group" "orders" {
                  vpc_zone_identifier = local.private_subnets_3az
                  min_size=4; desired=8; max_size=40
                  mixed_instances_policy {
                    launch_template { launch_template_specification {
                      launch_template_id = aws_launch_template.orders.id; version="$Latest"
                    } }
                    instances_distribution {
                      on_demand_base_capacity                  = 4
                      on_demand_percentage_above_base_capacity = 50
                      spot_allocation_strategy                 = "capacity-optimized"
                    }
                    override { instance_type = "c6i.4xlarge" }
                    override { instance_type = "c6a.4xlarge" }
                    override { instance_type = "c5.4xlarge" }
                    override { instance_type = "m6i.4xlarge" }
                  }
                  capacity_reservation_specification {
                    capacity_reservation_preference = "open"
                  }
                }
                resource "aws_ec2_capacity_reservation" "orders_floor" {
                  instance_type = "c6i.4xlarge"
                  instance_platform = "Linux/UNIX"
                  availability_zone = "us-east-1a"
                  instance_count    = 4
                  end_date_type     = "unlimited"
                  instance_match_criteria = "open"
                  tags = local.tags
                }
                IaC note: Mixed-instances + ODCR is the classic “floor + diversification” pattern. Floor for SLO floor; mixed-instances for burst.
                ConceptEC2 lifecycleS008 · 7/8

                Cheeky & prevention

                Cheeky #1

                Use attribute-based instance type selection (InstanceRequirements) instead of explicit type list — AWS picks any matching family; broadest capacity pool.

                Cheeky #2

                Spot Placement Score API tells you which region/AZ has best spot capacity right now for your shape — pre-flight checker for big batch jobs.

                Cheeky #3

                Reserve 4 ODCR seats. ASG burst beyond into on-demand, then spot. ICE in spot tier doesn't kill SLO because the floor is reserved.

                Prevent

                CW alarm on ASG metric GroupPendingInstances > 0 for 5 min → PagerDuty.

                Prevent

                Config + custom rule: ASGs in prod must specify mixed_instances_policy with at least 3 overrides.

                Prevent

                Annual capacity review in Q4 — quotas raised, ODCR sized to next year traffic forecast.

                LabEC2 lifecycleS008 · 8/8

                Interactive lab

                Lab S008: Trace ICE and propose fixsimulated
                Objectives
                  $
                  0 / 0
                  ConceptEC2 lifecycleS009 · 1/8

                  Symptom — SCP denies RunInstances for missing CostCenter

                  Observed

                  • Terraform apply fails with UnauthorizedOperation: with an explicit deny in a service control policy.
                  • Console launch by FedAdmin also fails — same SCP applies.
                  • Other instances in the account work fine — only new launches fail.

                  Constraints

                  SCPRequireTags on Workloads OU
                  Required tagsCostCenter, Owner, Env
                  Tag enforcementat RunInstances via aws:RequestTag/CostCenter
                  BypassFedAdmin role does not bypass SCP
                  Note: SCPs apply to all identities in member accounts, including the account root and admins. The only way around an SCP is to remove/change it at the org level.
                  VisualEC2 lifecycleS009 · 2/8

                  How SCP tag enforcement evaluates

                  RunInstances call structure SCP evaluation where it fails fix path ⓘ hover for detail
                  RunInstances API call structure caller EC2 API TagSpecifications array (per resource type) [{ ResourceType: instance, Tags: [...] }] [{ ResourceType: volume, Tags: [...] }] <-- often missing [{ ResourceType: network-interface, Tags: [...] }] SCP RequireTags evaluation SCP RequireTags Organizations Tag policy SCP statement (Suricata-like syntax) "Effect": "Deny", "Action": ["ec2:RunInstances", "rds:CreateDBInstance"], "Resource": "*", "Condition": { "Null": { "aws:RequestTag/CostCenter": "true" } } Where it fails Tf plan RunInstances Deny on Volume CloudTrail Common pitfalls 1. tag in tag block but not on 'instance' resource type 2. case mismatch: costcenter vs CostCenter (SCP is case-sensitive) 3. tag value disallowed by tag policy enforced_for 4. SCP fires on volume too (volume tag missing) 5. Tf default_tags don't apply to Volume in older providers (<v5) Fix — explicit tag_specifications + lint guard Terraform launch template (snippet) resource "aws_launch_template" "orders" { tag_specifications { resource_type="instance"; tags=local.tags } tag_specifications { resource_type="volume"; tags=local.tags } tag_specifications { resource_type="network-interface"; tags=local.tags } } Tip: shared locals.tags module — one source for all resources tflint custom rule (CI gate) plugin "aws" { ... } rule "aws_resource_missing_tags" { enabled = true tags = ["CostCenter","Owner","Env"] } Fails PR before SCP can deny in prod — faster feedback Bonus: Lambda enrichment for SCP errors EventBridge listens for UnauthorizedOperation; Lambda parses Resource ARN, looks up SCP, posts ‘you missed CostCenter on volume’ to dev's Slack — turns generic deny into actionable hint.
                  SimEC2 lifecycleS009 · 3/8

                  Hypotheses

                  #HypothesisDisprove
                  H1Tag missing entirelycompare Terraform plan to SCP
                  H2Tag on instance but not on volumereview tag_specifications
                  H3Case mismatchSCP aws:RequestTag/CostCenter is case-sensitive
                  H4Tag value not in allowed set (tag policy)describe-organizations-policies
                  H5SCP applies to OU; account moved recentlylist-parents + list-policies-for-target
                  Cheeky: aws iam simulate-principal-policy doesn't evaluate SCPs. Use IAM Access Analyzer policy validation + AWS Organizations list-policies-for-target to spot which SCPs are in scope before debugging.
                  LabEC2 lifecycleS009 · 4/8

                  Diagnose

                  # 1. Show the failing API call from CloudTrail
                  aws cloudtrail lookup-events \
                    --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances \
                    --max-results 1 --query 'Events[].CloudTrailEvent' \
                    | jq '.[0] | fromjson | {errorCode, errorMessage}'
                  
                  # 2. Pull SCPs in scope
                  aws --profile gc-mgmt organizations list-policies-for-target \
                    --target-id ou-xxx --filter SERVICE_CONTROL_POLICY
                  # 3. Validate Terraform tag plan
                  terraform show -json tfplan | jq '.. | objects | select(.tag_specifications) | .tag_specifications'
                  
                  # 4. Test directly
                  aws ec2 run-instances --image-id ami-0xx --instance-type t3.micro \
                    --tag-specifications 'ResourceType=instance,Tags=[{Key=CostCenter,Value=ENG-100},{Key=Owner,Value=alice},{Key=Env,Value=dev}]' \
                    'ResourceType=volume,Tags=[{Key=CostCenter,Value=ENG-100}]' \
                    --dry-run
                  SimEC2 lifecycleS009 · 5/8

                  Root cause

                  1. Terraform default_tags set CostCenter on the provider but the older AWS provider didn't propagate to Volume on RunInstances (only on the instance resource).
                  2. SCP RequireTags evaluates each TagSpecification separately — volume tag was missing → explicit deny.
                  3. Error message says “explicit deny in SCP” without naming the missing tag — misleading.
                  Gotcha: Terraform AWS provider < v5.0 doesn't propagate default_tags to all sub-resources. Upgrade or set tag_specifications explicitly per resource type.
                  IaCEC2 lifecycleS009 · 6/8

                  Fix — explicit tags everywhere

                  resource "aws_launch_template" "orders" {
                    ...
                    tag_specifications { resource_type="instance"; tags=local.tags }
                    tag_specifications { resource_type="volume";   tags=local.tags }
                    tag_specifications { resource_type="network-interface"; tags=local.tags }
                  }
                  
                  provider "aws" {
                    region = "us-east-1"
                    default_tags { tags = local.tags }
                  }
                  
                  locals {
                    tags = {
                      CostCenter = "ENG-100"
                      Owner      = "orders-team"
                      Env        = "prod"
                    }
                  }

                  Lint as a guardrail

                  # tflint plugin: aws-ruleset
                  plugin "aws" { enabled=true; version="0.30.0"; source="github.com/terraform-linters/tflint-ruleset-aws" }
                  rule "aws_resource_missing_tags" {
                    enabled = true
                    tags    = ["CostCenter", "Owner", "Env"]
                  }
                  IaC note: tflint runs in CI, fails the PR before SCP can deny in prod. Same tag list lives in tflint config + SCP — consider generating both from one HCL file.
                  ConceptEC2 lifecycleS009 · 7/8

                  Cheeky & prevention

                  Cheeky #1

                  Use tag policies (separate from SCP) to enforce case: tag_key: { @@assign: "CostCenter" }. Org standardizes “CostCenter” (not costcenter).

                  Cheeky #2

                  SCP message is generic. Add a Lambda that listens on UnauthorizedOperation CloudTrail events, parses the SCP, posts the missing-tag hint to the developer.

                  Cheeky #3

                  Pre-merge: terraform plan + parse JSON, check every tag_specifications contains required tags. Fail PR with the missing tag named.

                  Prevent

                  Config rule: required-tags across resource types; non-compliance → auto-tag (where allowed) or remediation Lambda.

                  Prevent

                  Service Catalog product abstracts tag handling so app teams can't forget.

                  Prevent

                  Per-account README pre-commit hook: requires CostCenter in locals.tf.

                  LabEC2 lifecycleS009 · 8/8

                  Interactive lab

                  Lab S009: Find which tag the SCP wantssimulated
                  Objectives
                    $
                    0 / 0
                    ConceptEC2 lifecycleS010 · 1/8

                    Symptom — SCP denies launch unless IMDSv2 required

                    Observed

                    • Legacy launch template still has HttpTokens=optional (allows IMDSv1).
                    • SCP denies RunInstances when ec2:MetadataHttpTokens != required.
                    • Old AMI also has SDK code that uses IMDSv1 only.

                    Constraints

                    • SCP applied to all Workloads OU accounts.
                    • Org-wide policy: IMDSv2 required from 2024-09.
                    • App written in 2017 with old AWS SDK.
                    Note: IMDSv2 = session-token + TTL. Mitigates SSRF that pulls credentials.
                    VisualEC2 lifecycleS010 · 2/8

                    IMDSv1 vs IMDSv2 flow

                    IMDSv1 (vulnerable) IMDSv2 (secure) SCP enforcement audit metric ⓘ hover for detail
                    IMDSv1 — vulnerable App IMDS v1 creds stolen Why this exists Default until 2019; legacy AMIs/SDKs require it SSRF in container app pivots to host IMDS No proof-of-presence; localhost vs SSRF indistinguishable Boto3 < 1.26 may not gracefully fall back to v2 IMDSv2 — secure (token + TTL) App (SDK) IMDS v2 creds Hop limit math HttpPutResponseHopLimit = 1: token never leaves host kernel = 2: ECS task on the host can read (one container hop) = 3+: pods on top of EKS may need this Lower is better; org caps at 2 by default SCP RequireIMDSv2 (org-wide) SCP Org root LT live modify SCP statements Deny ec2:RunInstances if MetadataHttpTokens != required Deny ec2:ModifyInstanceMetadataOptions if request weakens HttpTokens Cap HopLimit <= 2 across all paths Safe rollout: audit first, flip after 7 days clean MetadataNoToken dashboard flip-ready flip via SSM checkov pre-merge CKV_AWS_79 — LT must require IMDSv2 CKV_AWS_341 — HopLimit <= 2 Fails PR before SCP can deny — faster feedback Default Host Mgmt Configuration SSM auto-attaches IMDSv2 + agent + role org-wide Set once in gc-mgmt; applies to all accounts Reduces drift; new accounts auto-compliant Recovery if app breaks Bake-time fix: upgrade SDK in golden AMI Runtime fix: aws ec2 modify-instance-metadata-options Roll back on per-instance basis if user-data needs v1 Bonus: instance metadata tags metadata_options { instance_metadata_tags = "enabled" } App can read tags via IMDS without IAM permission Useful: read CostCenter / Env in app for log decoration Probe from inside instance TOKEN=$(curl -X PUT http://169.254.169.254/latest/api/token \ -H 'X-aws-ec2-metadata-token-ttl-seconds: 60') curl -H "X-aws-ec2-metadata-token: $TOKEN" .../meta-data/
                    SimEC2 lifecycleS010 · 3/8

                    Hypotheses

                    #HypothesisDisprove
                    H1LT has http_tokens=optionaldescribe-launch-template-versions
                    H2App SDK is too old for IMDSv2SDK version table check
                    H3Container hop-limit not 2 (ECS/k8s)http_put_response_hop_limit
                    H4SCP not applied to this account — some other denylist-policies-for-target
                    Cheeky: AWS publishes a CloudWatch metric MetadataNoToken per instance — non-zero means something is still doing IMDSv1. Hunt those before flipping the SCP.
                    LabEC2 lifecycleS010 · 4/8

                    Diagnose

                    # 1. LT current setting
                    aws ec2 describe-launch-template-versions \
                      --launch-template-id lt-0xx --versions '$Latest' \
                      --query 'LaunchTemplateVersions[].LaunchTemplateData.MetadataOptions'
                    
                    # 2. Per-instance audit
                    aws ec2 describe-instances --instance-ids i-0xx \
                      --query 'Reservations[].Instances[].MetadataOptions'
                    
                    # 3. Find IMDSv1 callers across fleet
                    aws cloudwatch get-metric-statistics --namespace AWS/EC2 \
                      --metric-name MetadataNoToken --dimensions Name=InstanceId,Value=i-0xx \
                      --start-time -1h --end-time now --period 60 --statistics Sum
                    # 4. Update LT (new version)
                    aws ec2 create-launch-template-version \
                      --launch-template-id lt-0xx \
                      --source-version '$Latest' \
                      --launch-template-data '{
                        "MetadataOptions":{
                          "HttpTokens":"required",
                          "HttpEndpoint":"enabled",
                          "HttpPutResponseHopLimit":2
                        }
                      }'
                    
                    # 5. Live-modify existing instances
                    aws ec2 modify-instance-metadata-options --instance-id i-0xx \
                      --http-tokens required --http-put-response-hop-limit 2
                    SimEC2 lifecycleS010 · 5/8

                    Root cause

                    1. Legacy LT was forked from a 2019 base; http_tokens defaulted to optional (IMDSv1+v2).
                    2. SCP added 2024-09 enforces required → new launches fail.
                    3. Fix is two parts: LT & in-app SDK upgrade so v2 works.
                    Gotcha: setting http_tokens=required immediately breaks any app still using IMDSv1. Run the audit first; flip after no MetadataNoToken metric for 7 days.
                    IaCEC2 lifecycleS010 · 6/8

                    Fix

                    resource "aws_launch_template" "orders" {
                      metadata_options {
                        http_endpoint               = "enabled"
                        http_tokens                 = "required"   # IMDSv2
                        http_put_response_hop_limit = 2
                        instance_metadata_tags      = "enabled"
                      }
                    }

                    Org-wide checkov rule

                    # .checkov.yml
                    check:
                      - CKV_AWS_79  # EC2 should require IMDSv2
                      - CKV_AWS_341 # LT hop_limit <= 2
                    IaC note: shift left — checkov in PR catches the violation before SCP does. Faster feedback to devs.
                    ConceptEC2 lifecycleS010 · 7/8

                    Cheeky & prevention

                    Cheeky #1

                    Run a fleet-wide modify-instance-metadata-options in a maintenance window via SSM Automation document. No restart needed.

                    Cheeky #2

                    For containerized workloads with hop limit issues: set hop limit to 1 if IMDS shouldn't reach pods (most secure), or 2 if needed for ECS task role pickup.

                    Cheeky #3

                    Use instance metadata tags (instance_metadata_tags=enabled) so apps can read tags without IAM perms — great for cost-center decoration in logs.

                    Prevent

                    CW dashboard tracks org-wide MetadataNoToken sum; alarm if any account has >0 over rolling 7 days.

                    Prevent

                    Config rule ec2-imdsv2-check flags non-compliant instances/LTs.

                    Prevent

                    SCP also denies ec2:ModifyInstanceMetadataOptions with HttpTokens=optional in request — can't weaken once enforced.

                    LabEC2 lifecycleS010 · 8/8

                    Interactive lab

                    Lab S010: Audit + flip IMDSv2simulated
                    Objectives
                      $
                      0 / 0
                      ConceptEC2 lifecycleS011 · 1/8

                      Symptom — restored EBS volume won't attach

                      Observed

                      • DR drill: restore RDS not feasible, instead restore an EBS snapshot to attach to a new instance.
                      • aws ec2 attach-volume fails: InvalidVolume.ZoneMismatch or just times out.
                      • Sometimes AccessDenied on KMS for the volume created in another region.

                      Constraints

                      • Snapshot owned by gc-prod-data (us-east-1).
                      • DR region: eu-west-1.
                      • Volume must attach to instance in eu-west-1c.
                      • Encrypted with regional KMS CMK (us-east-1 only).
                      VisualEC2 lifecycleS011 · 2/8

                      Snapshot copy + KMS re-encrypt path

                      source region (us-east-1) DR region (eu-west-1) AZ mismatch (the bug) DR runbook (SSM) ⓘ hover for detail
                      Region us-east-1 (source) source DB snap-src us KMS DLM policy AWS Backup DLM cross-region copy rule target_region = eu-west-1 encrypted = true cmk_arn = arn:aws:kms:eu-west-1:777...:key/orders-data-dr retain_rule = { interval=30, interval_unit=DAYS } copy preserves tags including SourceVolumeAz copy-snapshot async · re-encrypted Region eu-west-1 (DR) snap-dst eu KMS vol (1c) EC2 (1c) recovery plan AZ matching contract EBS volumes are AZ-locked — cannot move between AZs create-volume picks AZ; default is region's first AZ if not specified Instance must launch in same AZ — no exceptions Tag SourceVolumeAz on snapshot tells DR runbook the target AZ Default AZ choice in DR runbook is the most common gotcha AZ mismatch — the failure vol in 1a EC2 in 1c Fix sequence 1. Read SourceVolumeAz tag from snapshot 2. Check capacity in target AZ via Spot Placement Score 3. create-volume —availability-zone eu-west-1c 4. Launch instance in same AZ 5. Attach — success in seconds DR runbook — SSM Automation document GC-DR-RestoreEBS start read tags SPS check create-vol launch attach validate Annual game-day Cron: quarterly DR drill via the same doc Failure auto-creates Jira ticket; team must resolve RTO metric tracked in CW dashboard Cheeky tricks Fast Snapshot Restore (FSR) pre-warms volume First reads not throttled; great for DR drills Pay per AZ; enable just before drill, disable after Pre-flight by Service Catalog Promotion product validates: snap shared, KMS shared, AZ tag Refuses promotion of snapshot without SourceVolumeAz Catches AZ-mismatch class of bugs at promote time Multi-region: replica keys Use KMS multi-region keys for hot-standby DR Same key material in source + DR — no re-encrypt step Faster DR but tighter key blast radius
                      SimEC2 lifecycleS011 · 3/8

                      Hypotheses

                      #HypothesisDisprove
                      H1Volume in different AZ than instancedescribe-volumes AZ vs instance AZ
                      H2KMS key region mismatchdescribe-volume KmsKeyId region
                      H3Snapshot not yet completeddescribe-snapshots Progress
                      H4Snapshot not shared cross-acctdescribe-snapshot-attribute create-volume-permission
                      Cheeky: use EBS Fast Snapshot Restore when DR drilling — pre-warms the volume so first reads aren't I/O-throttled.
                      LabEC2 lifecycleS011 · 4/8

                      Diagnose

                      # 1. Volume + instance AZ
                      aws --region eu-west-1 ec2 describe-volumes \
                        --volume-ids vol-0xx --query 'Volumes[].{AZ:AvailabilityZone,KMS:KmsKeyId}'
                      aws --region eu-west-1 ec2 describe-instances \
                        --instance-ids i-0xx --query 'Reservations[].Instances[].Placement.AvailabilityZone'
                      
                      # 2. Re-create volume in correct AZ
                      aws --region eu-west-1 ec2 create-volume \
                        --snapshot-id snap-dst --availability-zone eu-west-1c \
                        --volume-type gp3 --encrypted --kms-key-id alias/eu-data
                      # 3. Snapshot progress
                      aws --region eu-west-1 ec2 describe-snapshots \
                        --snapshot-ids snap-dst --query 'Snapshots[].{P:Progress,S:State,K:KmsKeyId}'
                      
                      # 4. Cross-account share check
                      aws --profile gc-prod-data ec2 describe-snapshot-attribute \
                        --snapshot-id snap-src --attribute createVolumePermission
                      
                      # 5. Attach
                      aws --region eu-west-1 ec2 attach-volume \
                        --volume-id vol-new --instance-id i-0xx --device /dev/sdf
                      SimEC2 lifecycleS011 · 5/8

                      Root cause

                      1. Snapshot copy from us-east-1 → eu-west-1 used account default CMK in eu-west-1, not the team's named key.
                      2. Volume created in eu-west-1a (default), instance launched in eu-west-1c (chosen for capacity) → AZ mismatch.
                      3. The DR runbook didn't pin AZ explicitly.
                      Gotcha: EBS volumes are AZ-locked. Always create the volume in the same AZ as the target instance, or move the instance to where the volume lives.
                      IaCEC2 lifecycleS011 · 6/8

                      Fix — codify the DR restore

                      # SSM Automation document (Terraform-managed)
                      resource "aws_ssm_document" "dr_restore_volume" {
                        name          = "GC-DR-RestoreEBS"
                        document_type = "Automation"
                        content       = file("docs/dr-restore-ebs.yaml")
                      }
                      # dr-restore-ebs.yaml (excerpt)
                      parameters:
                        SnapshotId: { type: String }
                        TargetAz:   { type: String, default: "eu-west-1c" }
                        KmsKeyId:   { type: String, default: "alias/eu-data" }
                      mainSteps:
                        - name: copy
                          action: aws:executeAwsApi
                          inputs: { Service: ec2, Api: CopySnapshot, ... }
                        - name: wait
                          action: aws:waitForAwsResourceProperty
                        - name: create_volume
                          action: aws:executeAwsApi
                          inputs: { Api: CreateVolume,
                            AvailabilityZone: "{{ TargetAz }}" }
                      # DLM lifecycle policy
                      resource "aws_dlm_lifecycle_policy" "orders_data" {
                        description = "orders-data daily snap + DR copy"
                        state       = "ENABLED"
                        policy_details {
                          schedule {
                            cross_region_copy_rule {
                              target = "eu-west-1"
                              encrypted = true
                              cmk_arn   = aws_kms_alias.eu_data.target_key_arn
                              retain_rule { interval = 30; interval_unit = "DAYS" }
                            }
                          }
                        }
                      }
                      IaC note: DLM with cross-region copy + named KMS key + DR restore SSM doc — the whole DR path is reproducible. Annual game-day exercises validate.
                      ConceptEC2 lifecycleS011 · 7/8

                      Cheeky & prevention

                      Cheeky #1

                      Tag the snapshot with SourceVolumeAz; the DR doc reads it and tries to match the target AZ first.

                      Cheeky #2

                      Use EBS Multi-Attach (io1/io2) only with apps that support distributed locking; otherwise corruption.

                      Cheeky #3

                      Convert old gp2 to gp3 for free baseline IOPS bump — one modify-volume call, no downtime.

                      Prevent

                      Quarterly DR game-day uses the SSM Automation doc end-to-end. Failure auto-creates Jira ticket.

                      Prevent

                      Config rule: ebs-snapshot-public-restorable-check + custom rule for cross-region copy presence.

                      Prevent

                      Backup & DR tag enforced via SCP — instances missing BackupPolicy tag get denied at launch.

                      LabEC2 lifecycleS011 · 8/8

                      Interactive lab

                      Lab S011: Restore EBS in DR regionsimulated
                      Objectives
                        $
                        0 / 0
                        ConceptEC2 lifecycleS012 · 1/8

                        Symptom — app OOM seconds after launch

                        Observed

                        • Java app crashes with OOM 30s after instance running.
                        • EC2 type c6i.xlarge (4 vCPU, 8 GiB).
                        • Root volume 8 GiB — full after agent installs.
                        • No swap configured.

                        Constraints

                        • Java -Xmx not set; defaults to 25% of memory.
                        • SCP forbids ebs-optimized=false.
                        • Org default ebs-encryption-by-default=true.
                        VisualEC2 lifecycleS012 · 2/8

                        Memory + disk picture

                        broken layout recommended layout OOM-killer pathway ⓘ hover for detail
                        Today — 8 GiB root, no swap, JVM default EC2 root 8 GiB JVM 2g no swap Failure timeline T+0: instance running, app starting T+10s: CW agent + SSM agent loaded (~600 MiB) T+15s: Java loads, heap grows toward Xmx T+20s: RAM saturated, kernel paging blocked (no swap) T+30s: Linux OOM-killer kills java — not Java OOM Recommended — sized + bounded + observed EC2 root 30 GiB log vol 20 GiB JVM Xmx 5g 2G swap Why each piece matters 30 GiB root: 4× headroom for agents + tmp + crash dumps /var/log on its own volume: never fills root JVM bounded with ExitOnOOM: app exits cleanly → ASG replaces Swap soft buffer: handles transient spikes without OOM-killer gp3 baseline IOPS prevents the 'mysterious p99 spike' class Disk arithmetic + I/O Root volume budget SSM agent + CW agent: 600 MiB OS base (AL2023): 2 GiB App jar + libs: 800 MiB /tmp: 1 GiB (use tmpfs!) /var/cache, /var/lib: 2 GiB Total: ~6 GiB · 8 GiB root has 25% margin = always full gp2 burst trap 8 GiB gp2 = 100 IOPS baseline (small) Bursts to 3000 IOPS but earns credits at 100/s Sustained I/O exhausts BurstBalance → throttle → mystery p99 Linux OOM-killer vs Java OutOfMemoryError — identify which fires Linux OOM java pid kernel log vs. Java OOM app log .hprof CW agent metrics mem_used_percent · mem_available disk_used_percent (per filesystem) swap_used_percent Alarms FilesystemUsedPct > 80 (/, /var/log) mem_used_percent > 90 for 2 min EBS BurstBalance < 10% (gp2 only) Pre-deploy unit test Spin in sandbox subnet Run app + chaos load 5 min Assert no OOM, no >80% disk tmpfs trick tmpfs /tmp tmpfs size=512m,nodev,nosuid 0 0 /tmp in RAM with cap; prevents tmp file bombs from filling root cgroup-aware Java (containerized) Java 11+ honors cgroup memory limit automatically For ECS/EKS: ditch swap, cap at task definition memory JVM picks Xmx from cgroup limit gp3 IOPS bump Baseline 3000 IOPS / 125 MiB/s free Bump to 5000 IOPS at $0.005/IOP-hour Cheap p99 win for IO-heavy workloads
                        SimEC2 lifecycleS012 · 3/8

                        Hypotheses

                        #HypothesisDisprove
                        H1Root volume too small / fulldf -h /
                        H2JVM heap default too small/largejcmd VM.flags
                        H3EBS burst credits exhausted (gp2)CW BurstBalance metric
                        H4OOM-killer killed app, not Java OOMdmesg | grep -i oom
                        Cheeky: “Java OOM” vs “Linux OOM” are different. JVM throws OutOfMemoryError; kernel logs oom-kill. Always check dmesg + journalctl -k.
                        LabEC2 lifecycleS012 · 4/8

                        Diagnose

                        # 1. Disk + memory
                        df -h /
                        free -m
                        swapon --show
                        
                        # 2. JVM flags
                        sudo -u app jps -l
                        sudo -u app jcmd <pid> VM.flags | grep -E 'MaxHeap|MinHeap|UseG1'
                        sudo -u app jcmd <pid> VM.system_properties | grep mx
                        # 3. EBS burst (gp2)
                        aws cloudwatch get-metric-statistics --namespace AWS/EBS \
                          --metric-name BurstBalance --dimensions Name=VolumeId,Value=vol-0xx \
                          --start-time -1h --end-time now --period 60 --statistics Minimum
                        
                        # 4. Linux OOM-kill
                        sudo dmesg | grep -i 'killed process'
                        sudo journalctl -k --since "1 hour ago" | grep -i oom
                        SimEC2 lifecycleS012 · 5/8

                        Root cause

                        1. Root EBS 8 GiB full after CW Agent + SSM + Java + tmp.
                        2. Logs filled remaining space within minutes; jvm.log redirected to /var/log.
                        3. JVM tried to allocate heap; fork/native allocation hit ENOMEM → Linux OOM-kill (not Java OOM).
                        Gotcha: on small root volumes, journalctl + log rotation lag means /var fills quickly. Pin a separate volume for /var/log.
                        IaCEC2 lifecycleS012 · 6/8

                        Fix

                        resource "aws_launch_template" "orders" {
                          ebs_optimized = true
                          block_device_mappings {
                            device_name = "/dev/xvda"
                            ebs { volume_size=30; volume_type="gp3"; iops=3000; throughput=125; encrypted=true }
                          }
                          block_device_mappings {
                            device_name = "/dev/sdb"
                            ebs { volume_size=20; volume_type="gp3"; encrypted=true }   # /var/log
                          }
                          user_data = base64encode(file("ud.sh"))   # mounts + JVM tuning
                        }
                        # ud.sh excerpt
                        mkfs.xfs /dev/nvme1n1
                        mount /dev/nvme1n1 /var/log
                        echo "/dev/nvme1n1 /var/log xfs defaults,nofail 0 2" >> /etc/fstab
                        
                        # swap
                        fallocate -l 2G /swapfile && chmod 600 /swapfile
                        mkswap /swapfile && swapon /swapfile
                        
                        # JVM
                        echo 'JAVA_OPTS="-Xms4g -Xmx5g -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError"' \
                          >> /etc/orders-api.env
                        IaC note: use volume_type=gp3 uniformly; gp2's burst credits are a frequent source of mysterious p99 spikes.
                        ConceptEC2 lifecycleS012 · 7/8

                        Cheeky & prevention

                        Cheeky #1

                        For containerized apps: ditch swap; cap memory at the cgroup level. Java 11+ honors cgroup memory automatically.

                        Cheeky #2

                        Use tmpfs for /tmp with size cap — prevents tmp file bombs from filling root.

                        Cheeky #3

                        Pre-warm gp3: 3000 IOPS / 125 MB/s baseline is free; bump to 5000 IOPS at $0.005 per IOP-hour. Cheap p99 win.

                        Prevent

                        CW agent installs diskspace+swap custom metrics; alarm on FilesystemUsedPct > 80 for / and /var/log.

                        Prevent

                        SSM Compliance state pack — instance must report log-rotate active.

                        Prevent

                        Pre-deploy unit test: spin instance with target user-data in sandbox; run app + chaos load; assert no OOM in first 5 min.

                        LabEC2 lifecycleS012 · 8/8

                        Interactive lab

                        Lab S012: Find what killed the appsimulated
                        Objectives
                          $
                          0 / 0
                          ConceptEC2 lifecycleS013 · 1/8

                          Symptom — spot interruption causes ALB 5xx burst

                          Observed

                          • ALB target group sees ~30s of 5xx during spot interruption.
                          • Instance vanishes with no graceful drain.
                          • Customer error rate breached SLO.

                          Constraints

                          • Spot fleet 60% of ASG.
                          • 2-min interruption notice via IMDS.
                          • ALB deregistration delay default 300s.
                          VisualEC2 lifecycleS013 · 2/8

                          Spot interruption flow vs graceful drain

                          without hook (5xx) with hook + NTH signal sources ⓘ hover for detail
                          Without hook — the 5xx pattern spot EC2 notice yanked 5xx burst user errors Why ALB keeps routing Health-check probes have intervals (10-30s) Target marked unhealthy only after N consecutive failures Cross-zone LB still sends traffic until removal Default deregistration_delay 300s — way longer than 2-min spot notice With hook + NTH (Node Termination Handler) spot EC2 NTH hook drain 30s clean NTH details aws/aws-node-termination-handler — OSS, AWS-maintained EC2: systemd service; k8s: DaemonSet Two modes: IMDS polling OR EventBridge+SQS queue Queue mode survives instance loss; preferred at scale Spot interruption signals — layered detection IMDS poll EventBridge SQS consumer drain ASG capacity rebalance capacity_rebalance = true on ASG AWS proactively replaces at-risk spot instances Reduces interruption frequency vs replacement-on-kill Allocation strategies capacity-optimized: pick AZ/type with most headroom capacity-optimized-prioritized: w/ override priority Lowest interruption rate — recommended over price-capacity Mix base + spot on_demand_base = SLO floor spot above for cost SLO never on spot only Game-day chaos test FIS Action: aws:ec2:send-spot-instance-interruptions Triggers a fake interruption on a real instance Validate: 0 5xx, drain < 90s, ALB target healthy fast CW alarm wiring HTTPCode_ELB_5XX_Count > baseline + 3sigma Per-AZ — spot interruption clusters by AZ Compare against Spot Interruption metric Pre-deploy gate (CI) Spot Placement Score > 7 required for prod If lower, fail PR or constrain to lower spot ratio Tracked alongside instance type SCP
                          SimEC2 lifecycleS013 · 3/8

                          Hypotheses

                          #HypothesisDisprove
                          H1No ASG lifecycle hook for terminatedescribe-lifecycle-hooks
                          H2Hook exists but no handler subscribedEventBridge target wired?
                          H3ALB deregistration_delay too long, instance gone before drainTG attribute
                          H4Health check passes but pool keeps dead connsKeep-alive timeout vs idle
                          Cheeky: use the AWS Node Termination Handler on EC2 (or NTH for k8s) — pulls IMDS interruption notice and drains for you.
                          LabEC2 lifecycleS013 · 4/8

                          Diagnose

                          # 1. Lifecycle hooks on the ASG
                          aws autoscaling describe-lifecycle-hooks --auto-scaling-group-name orders-asg
                          
                          # 2. ALB TG drain
                          aws elbv2 describe-target-group-attributes --target-group-arn arn:...:targetgroup/orders/...
                          # expect deregistration_delay.timeout_seconds <= 60 for most apps
                          
                          # 3. Spot interruption history
                          aws ec2 describe-spot-instance-requests --filters Name=state,Values=closed
                          # 4. Listen for interruption from inside instance
                          TOKEN=$(curl -s -X PUT http://169.254.169.254/latest/api/token \
                            -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
                          curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
                            http://169.254.169.254/latest/meta-data/spot/instance-action
                          
                          # 5. Run NTH locally
                          sudo systemctl status aws-node-termination-handler
                          SimEC2 lifecycleS013 · 5/8

                          Root cause

                          1. ASG had no terminate lifecycle hook for spot interruption.
                          2. EC2 was yanked at T-2min; ALB kept routing for ~30s.
                          3. App didn't shed connections proactively.
                          Gotcha: spot interruption is faster than ASG termination — EC2 Spot Instance Interruption Warning fires at T-2min, but the instance dies at T-0 regardless of ASG hook delay.
                          IaCEC2 lifecycleS013 · 6/8

                          Fix — ASG hook + NTH + ALB tune

                          resource "aws_autoscaling_lifecycle_hook" "terminate" {
                            name                   = "orders-terminate"
                            autoscaling_group_name = aws_autoscaling_group.orders.name
                            lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
                            default_result         = "CONTINUE"
                            heartbeat_timeout      = 90
                            notification_target_arn = aws_sns_topic.lifecycle.arn
                            role_arn               = aws_iam_role.lifecycle.arn
                          }
                          resource "aws_lb_target_group" "orders" {
                            ...
                            deregistration_delay = 30
                          }
                          # NTH on each instance (DaemonSet for k8s, systemd for plain EC2)
                          provisioner "file" {
                            destination = "/etc/systemd/system/aws-node-termination-handler.service"
                            content     = file("nth.service")
                          }
                          IaC note: NTH listens on IMDS for interruption + ASG lifecycle SQS; calls deregister + waits drain; completes lifecycle action.
                          ConceptEC2 lifecycleS013 · 7/8

                          Cheeky & prevention

                          Cheeky #1

                          Set connection_termination=true on NLB TGs — existing flows are reset on deregister, faster recovery.

                          Cheeky #2

                          Mix on-demand base + spot for bursty workloads; SLO floor never on spot.

                          Cheeky #3

                          Use capacity-optimized-prioritized spot allocation if order matters; lowers interruption rate.

                          Prevent

                          CW alarm: HTTPCode_ELB_5XX_Count > baseline + 3sigma during spot events.

                          Prevent

                          Chaos game day: trigger fake interruption via describe-spot-fleet-request-history sim; assert 0 5xx.

                          Prevent

                          Spot Placement Score > 7 required by Terraform pre-flight check.

                          LabEC2 lifecycleS013 · 8/8

                          Interactive lab

                          Lab S013: Detect interruption + drainsimulated
                          Objectives
                            $
                            0 / 0
                            ConceptEC2 lifecycleS014 · 1/8

                            Symptom — ASG using stale LT version

                            Observed

                            • Old AMI behavior in fresh instances even after promotion.
                            • ASG $Default points to v3; v5 is latest.
                            • Console shows v5 marked Default but actual launches use v3.

                            Constraints

                            • ASG launch_template.version = "$Default" in Terraform.
                            • Manual set-default-version happened in console.
                            • Drift between code and live state.
                            VisualEC2 lifecycleS014 · 2/8

                            $Latest vs $Default vs explicit version

                            $Default (drift risk) $Latest (churn) explicit version (recommended) drift detection ⓘ hover for detail
                            $Latest — always newest LT $Latest ASG ref Trade-offs + Always latest AMI/config - Plan churn on every Tf run - Hard to gate promotion via Tf Use only if pipeline writes new version + Tf refresh same minute $Default — the drift trap LT $Default=3 ASG → v3 Why this is hidden Tf considers '$Default' static; no drift detected Console set-default-version doesn't propagate Promotion job creates new version — doesn't bump default ASG silently using ancient AMI / config Explicit version — deterministic LT v=5 ASG v=5 Plus instance refresh trigger instance_refresh.triggers = ['launch_template'] LT version bump → ASG auto-rolls fleet min_healthy_percentage = 90 No console-edit drift; Tf is source of truth Nightly drift check cron tf plan Jira What drift looks like ~ aws_launch_template.x default_version: 3 -> 5 Console-set; reverted on next apply Use Pattern C if console-edits expected Prevent console drift — layered controls SCP guard user denied EventBridge audit (catches non-LT changes) Service Catalog AMI promotion Promotion product creates new LT version Tags LT with PromotedAt, BuildSha Refuses promotion of LT > 30 days old Forces fresh bake every release cycle Checkpoint instance refresh Roll a small percent first (canary) Pause; observe metrics; continue auto_rollback on health failure Safer than all-at-once even with right pinning Tags on LT version BuildSha, BuildAt, Promoter Post-mortem can identify which version a misbehaving instance came from If you must use $Default Add Lambda that asserts DefaultVersion == latest_version daily If divergent, page or auto-fix via set-default-version Closes the drift gap without giving up the convenience
                            SimEC2 lifecycleS014 · 3/8

                            Hypotheses

                            #HypothesisDisprove
                            H1$Default not bumped to v5describe-launch-templates DefaultVersion
                            H2ASG uses pinned version, not $Defaultdescribe-auto-scaling-groups LaunchTemplate.Version
                            H3Console-edited LT outside Terraformdiff Terraform state
                            Cheeky: always pin to $Latest + use ASG instance_refresh.triggers=["launch_template"]. Terraform updates LT, ASG auto-rolls.
                            LabEC2 lifecycleS014 · 4/8

                            Diagnose

                            # 1. LT versions
                            aws ec2 describe-launch-template-versions \
                              --launch-template-id lt-0xx \
                              --query 'LaunchTemplateVersions[].{V:VersionNumber,D:DefaultVersion,I:LaunchTemplateData.ImageId}'
                            
                            # 2. ASG launch config
                            aws autoscaling describe-auto-scaling-groups \
                              --auto-scaling-group-names orders-asg \
                              --query 'AutoScalingGroups[].LaunchTemplate'
                            # 3. Force ASG to v5 explicit
                            aws autoscaling update-auto-scaling-group \
                              --auto-scaling-group-name orders-asg \
                              --launch-template LaunchTemplateId=lt-0xx,Version=5
                            
                            # 4. Trigger refresh
                            aws autoscaling start-instance-refresh \
                              --auto-scaling-group-name orders-asg \
                              --preferences MinHealthyPercentage=90,InstanceWarmup=120
                            SimEC2 lifecycleS014 · 5/8

                            Root cause

                            1. Terraform set launch_template { version = "$Default" }.
                            2. Promotion script created v5 but didn't call set-default-version.
                            3. $Default still v3; ASG still launches v3.
                            Gotcha: Terraform considers $Default a static string, doesn't track LT version drift — manual changes don't register as drift.
                            IaCEC2 lifecycleS014 · 6/8

                            Fix

                            resource "aws_autoscaling_group" "orders" {
                              launch_template {
                                id      = aws_launch_template.orders.id
                                version = aws_launch_template.orders.latest_version  # pin explicit
                              }
                              instance_refresh {
                                strategy = "Rolling"
                                triggers = ["launch_template"]
                                preferences { min_healthy_percentage = 90; instance_warmup = 120 }
                              }
                            }
                            IaC note: using latest_version attribute makes Terraform track every LT bump. Combined with instance refresh trigger, every PR rolls the fleet automatically.

                            CI drift check

                            # nightly cron in CI
                            terraform plan -refresh-only -detailed-exitcode
                            # exit 2 = drift; raise issue
                            ConceptEC2 lifecycleS014 · 7/8

                            Cheeky & prevention

                            Cheeky #1

                            Tag every LT version with PromotedAt; promotion job blocks promotion of versions older than 30 days — forces fresh bakes.

                            Cheeky #2

                            Use checkpoint instance refresh: roll a small percent first, observe metrics, continue.

                            Cheeky #3

                            If you must use $Default, add a Lambda that asserts DefaultVersion == latest_version daily — closes the drift gap.

                            Prevent

                            SCP denies ec2:ModifyLaunchTemplate in prod accounts — only CI role can change.

                            Prevent

                            Tags on LT version (BuildSha, BuildAt) so post-mortems can identify which LT version a misbehaving instance came from.

                            Prevent

                            EventBridge rule on ModifyLaunchTemplate outside CI role → alert.

                            LabEC2 lifecycleS014 · 8/8

                            Interactive lab

                            Lab S014: Find which LT version ASG actually usessimulated
                            Objectives
                              $
                              0 / 0
                              ConceptEC2 lifecycleS015 · 1/8

                              Symptom — user-data fetch races with role attach

                              Observed

                              • User-data calls aws secretsmanager get-secret-value.
                              • Random failures: ~10% of launches return UnrecognizedClientException: The security token included in the request is invalid.
                              • Retrying the same instance 30s later succeeds.

                              Constraints

                              • SCP requires IMDSv2.
                              • Role attached at LT.
                              • Instance profile propagation eventually consistent (1–2 sec usually).
                              VisualEC2 lifecycleS015 · 2/8

                              Boot order race

                              race window (the bug) wait-for-iam pattern SSM Run Command alternative ⓘ hover for detail
                              Boot order race — ~10% failure window T+0 running T+1s IMDS 404 Secrets fail cloud-init err Why this is hard to find ~10% of launches affected — intermittent Retry the same instance 30s later: it works SDK retries throttling but NOT UnrecognizedClient No probe in user-data: failure mode is silent Worse with VPCe + bootstrap density at peak hours wait-for-iam pattern shared script STS probe secrets ok user-data snippet set -euo pipefail for i in {1..30}; do aws sts get-caller-identity \ >/dev/null 2>&1 && break sleep 2; done Alternative pattern: SSM Run Command on first boot EC2 ready EB rule SSM doc complete Trade-offs of SSM approach + No race window (SSM agent is registered when running) + Auditable in CloudTrail + SSM Run Command history + Idempotent · can re-run without rebuild - Adds ~10s to boot time (EB latency) - Requires SSM agent + VPCe in instance subnet Detection CW Logs metric filter on cloud-init log filter: UnrecognizedClient | InvalidSignature Alarm if non-zero in 5 min Synthetic test: launch 1 instance every 4h, validate cloud-init unit ordering After=cloud-init.target Wants=instance-meta.target Custom: After=imds-creds-ready.service Bonus: Image Builder Bake the wait-for-iam loop into AMI itself as systemd unit user-data only does instance-specific config Centralized in image; no copy-paste in user-data
                              SimEC2 lifecycleS015 · 3/8

                              Hypotheses

                              #HypothesisDisprove
                              H1IMDS not yet returning role credsadd wait-for-creds loop, observe
                              H2Network not up yet (race with eth0)cloud-init cloud-init.target ordering
                              H3VPCe DNS not yet resolvinggetent hosts secretsmanager.us-east-1...
                              H4Time skew — SigV4 failschronyc sources
                              Cheeky: use aws sts get-caller-identity as a probe. Loop until it succeeds; then call SecretsManager.
                              LabEC2 lifecycleS015 · 4/8

                              Diagnose

                              # 1. Confirm the race
                              sudo grep -E 'UnrecognizedClient|InvalidSignatureException' /var/log/cloud-init-output.log
                              
                              # 2. Test from instance after boot
                              for i in 1 2 3; do
                                aws sts get-caller-identity || echo retry; sleep 1
                              done
                              # 3. Wait pattern in user-data
                              until aws sts get-caller-identity >/dev/null 2>&1; do sleep 2; done
                              SECRET=$(aws secretsmanager get-secret-value --secret-id orders-prod \
                                --query SecretString --output text)
                              SimEC2 lifecycleS015 · 5/8

                              Root cause

                              1. EC2 reports running before instance profile credentials propagate to IMDS.
                              2. User-data scripts that hit AWS APIs during the first few seconds occasionally hit the gap.
                              3. SDK retries throttling but not UnrecognizedClient by default.
                              Gotcha: instance profile creds are usually < 1s but can be longer in busy regions. Always wait-for-creds before sensitive calls.
                              IaCEC2 lifecycleS015 · 6/8

                              Fix

                              # ud.sh template
                              #!/usr/bin/env bash
                              set -euo pipefail
                              TOK=$(curl -s -X PUT http://169.254.169.254/latest/api/token \
                                -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
                              for i in {1..30}; do
                                if aws sts get-caller-identity >/dev/null 2>&1; then break; fi
                                sleep 2
                              done
                              SECRET=$(aws secretsmanager get-secret-value --secret-id ${secret_id} \
                                --query SecretString --output text)
                              IaC note: ship a shared wait-for-iam.sh across all repos. Single source for the wait loop — never re-derive.

                              Or move secret pull to SSM Run Command

                              # EventBridge on EC2 running → SSM doc → pull secret
                              ConceptEC2 lifecycleS015 · 7/8

                              Cheeky & prevention

                              Cheeky #1

                              Use SSM Parameter Store for non-secret bootstrap config — same race avoidance, simpler IAM.

                              Cheeky #2

                              For Windows, the EC2Launch v2 task graph supports dependencies — ensure secret-pull task waits on aws-cli-ready.

                              Cheeky #3

                              Use a CW alarm on user-data failures — metric filter on cloud-init log shipping.

                              Prevent

                              cloud-init unit ordering: After=cloud-init.target + Wants=instance-meta.target.

                              Prevent

                              Bake the wait-for-iam loop into Image Builder component; user-data never repeats it.

                              Prevent

                              Synthetic test: launch test instance every 4h, assert no UnrecognizedClient in logs.

                              LabEC2 lifecycleS015 · 8/8

                              Interactive lab

                              Lab S015: Add the wait-for-iam loopsimulated
                              Objectives
                                $
                                0 / 0
                                ConceptEC2 lifecycleS016 · 1/8

                                Symptom — tag policy case mismatch

                                Observed

                                • Some instances appear in console as “non-compliant” under Tag Policy view.
                                • Tag value is Production, policy wants prod.
                                • Resources still launch — tag policies don't deny by default.

                                Constraints

                                • Org tag policy enforces Env values: [prod, stg, dev].
                                • Old IaC writes Production.
                                • Drift surfaces in Tag Policy compliance, not in CloudTrail errorCode.
                                VisualEC2 lifecycleS016 · 2/8

                                Tag policy vs SCP

                                Tag policy (advisory by default) SCP (enforcement) combo: SCP + Tag policy + Config case mismatch (the bug) ⓘ hover for detail
                                Tag policy — advisory by default tag policy org root tag-aware reports Capabilities + enforces tag value (case-sensitive set) + enforces casing (CostCenter not costcenter) + surfaces non-compliant resources in reports - Doesn't block creation by default (need 'enforced_for') - Acts only on tag-aware ops; SCP fills the gap SCP — explicit deny at API time SCP RequireTags condition eval denied Capabilities + Hard pre-execution block + Works on every IAM-evaluated call + FedAdmin doesn't bypass - Cannot validate tag VALUE - Generic deny message; doesn't say which tag Best combo — layered policy SCP: presence tag pol: value Config: monitor tflint: pre-merge enricher Why all three tflint catches at PR time (fastest feedback) SCP catches at API time (last line of defense) Tag policy validates VALUE that SCP can't see Config catches drift / pre-existing resources Layered defense; one failure mode doesn't slip through The case mismatch — how it accumulates silently old IaC 42 resources discovery re-tag strict Tag policy strict mode (HCL) "Env": { "tag_value": {"@@assign": ["prod","stg","dev"]}, "enforced_for": {"@@assign": ["ec2:instance","rds:db"]} } Auto-remediation Lambda EventBridge: TagResource event Lambda: lowercase Env value if matches enum Self-correcting; reports any non-fixable Per-account tag standard module module "stdtags" { source = "../../modules/stdtags" } provider "aws" { default_tags { tags = module.stdtags.tags } } Single source for tag values across the org — no copy-paste, no drift
                                SimEC2 lifecycleS016 · 3/8

                                Hypotheses

                                #HypothesisDisprove
                                H1Stale Terraform writes wrong casegrep Production
                                H2Tag policy not actually enforceddescribe-policy + enforced_for
                                H3Auto-tagging Lambda overwritesCloudTrail TagResource events
                                Cheeky: Tag policies have a quirky inheritance with operators (@@assign, @@append, @@enforced_for). Always check effective-policy at the OU/account level — not the policy doc.
                                LabEC2 lifecycleS016 · 4/8

                                Diagnose

                                # 1. Effective tag policy at account level
                                aws --profile gc-mgmt organizations describe-effective-policy \
                                  --policy-type TAG_POLICY --target-id 666666666666
                                
                                # 2. Find non-compliant resources
                                aws resourcegroupstaggingapi get-resources \
                                  --tag-filters Key=Env,Values=Production
                                # compare to allowed: prod / stg / dev
                                # 3. Bulk re-tag
                                aws resourcegroupstaggingapi tag-resources \
                                  --resource-arn-list arn:aws:ec2:...:instance/i-0xx \
                                  --tags Env=prod
                                
                                # 4. Tag policy compliance summary
                                aws --profile gc-audit config describe-compliance-by-config-rule \
                                  --config-rule-names required-tags
                                SimEC2 lifecycleS016 · 5/8

                                Root cause

                                1. Tag policy enforces Env values {prod, stg, dev}.
                                2. Old IaC pinned Env=Production → non-compliant but not blocked.
                                3. Compliance signal accumulated until audit team flagged.
                                Gotcha: tag policies are case-sensitive on values. Productionprod.
                                IaCEC2 lifecycleS016 · 6/8

                                Fix

                                locals {
                                  tags = merge({
                                    Env        = "prod"
                                    CostCenter = "ENG-100"
                                    Owner      = "orders-team"
                                  }, var.extra_tags)
                                }
                                
                                provider "aws" {
                                  default_tags { tags = local.tags }
                                }
                                # tag policy strict mode
                                {
                                  "tags": {
                                    "Env": {
                                      "tag_key": { "@@assign": "Env" },
                                      "tag_value": { "@@assign": ["prod", "stg", "dev"] },
                                      "enforced_for": { "@@assign": ["ec2:instance", "rds:db"] }
                                    }
                                  }
                                }
                                IaC note: enforced_for with resource type list converts tag policy from advisory to enforced — tag-aware ops will fail.
                                ConceptEC2 lifecycleS016 · 7/8

                                Cheeky & prevention

                                Cheeky #1

                                Add a Lambda that auto-remediates: on TagResource event, lowercase the value if matches enum.

                                Cheeky #2

                                Use Resource Groups with tag filters as the source of truth for “all prod EC2” — surfaces non-compliant tags fast.

                                Cheeky #3

                                tflint custom rule: deny Env values not in [prod, stg, dev].

                                Prevent

                                Tag policy in enforced_for mode + tflint: catch at PR-time and at API-time.

                                Prevent

                                Quarterly compliance review pulled from Tag Policy compliance API.

                                Prevent

                                One-line Terraform module everyone consumes: module "stdtags".

                                LabEC2 lifecycleS016 · 8/8

                                Interactive lab

                                Lab S016: Find non-compliant Env valuessimulated
                                Objectives
                                  $
                                  0 / 0
                                  ConceptEC2 lifecycleS017 · 1/8

                                  Symptom — subnet IPs exhausted

                                  Observed

                                  • ASG can't scale: InsufficientFreeAddressesInSubnet.
                                  • Subnet is /27 with 3 IPs free out of 32.
                                  • Many ENIs are detached but not released.

                                  Constraints

                                  • Subnet originally sized for 4 instances.
                                  • EKS pods consume secondary IPs aggressively.
                                  • Lambda VPC ENIs hold onto IPs.
                                  VisualEC2 lifecycleS017 · 2/8

                                  Where IPs go in a /27

                                  IP exhaustion (the bug) EKS / Lambda IP consumers fix: secondary CIDR + prefix delegation ⓘ hover for detail
                                  /27 IP budget — the trap /27 = 32 IPs 5 reserved 27 usable exhausted When /27 stops working EKS without prefix delegation: 10-20 IPs / EC2 node Lambda VPC: 1 ENI per (subnet, SG) combo, held 40 min ECS Fargate: 1 IP per task ENI RDS Multi-AZ: 1 + n replicas Standard sizing: never < /22 in prod IP consumers (who's eating them) EKS CNI pod IPs Lambda VPC held 40m Quick math (m5.xlarge node) Default cni: 1 ENI primary + 3 ENIs × 14 IPs each = 43 IPs Single node alone busts /27 budget Add 8 Lambda VPC ENIs · 6 ECS Fargate tasks · ALB ENI · NAT /27 is gone in seconds at any scale Solution: prefix delegation OR secondary VPC CIDR Fix path Path A: secondary VPC CIDR (100.64/10) Path B: EKS prefix delegation Configurable per addon · idempotent · no node restart Detection + prevention — layered IP capacity hygiene IPAM alarm 20% cleaner Config rule Subnet sizing standard Prod private: /22 minimum (1024 IPs) Restricted: /23 OK (512 IPs) DMZ: /24 typical (256 IPs) Transit: /28 (TGW ENI requires only 16) EKS-specific tuning ENABLE_PREFIX_DELEGATION = true WARM_PREFIX_TARGET = 1 (allocate sparingly) MINIMUM_IP_TARGET = 0 (no warm pool) Or use IPv6 mode if greenfield Lambda VPC alternatives Use VPC Lattice (no Hyperplane) Or run as ECS Fargate task Or use VPC endpoints to skip Lambda VPC Quarterly capacity review Per-VPC IP forecast vs actual Identify subnets approaching 70% usage Plan secondary CIDR or migrate workloads ahead of crunch
                                  SimEC2 lifecycleS017 · 3/8

                                  Hypotheses

                                  #HypothesisDisprove
                                  H1Subnet truly fulldescribe-subnets AvailableIpAddressCount
                                  H2EKS warm pool grabbing IPsdescribe-network-interfaces by Description
                                  H3Detached ENIs held by Lambda VPC / DLMdescribe-network-interfaces Status=available
                                  H4ECS tasks awaitingENI's, never deletedECS service events
                                  Cheeky: add a secondary CIDR to the VPC and migrate IP-heavy subnets there — no downtime, no resubnet.
                                  LabEC2 lifecycleS017 · 4/8

                                  Diagnose

                                  # 1. IP availability
                                  aws ec2 describe-subnets --filters Name=vpc-id,Values=vpc-0xx \
                                    --query 'Subnets[].{ID:SubnetId,AZ:AvailabilityZone,Free:AvailableIpAddressCount,CIDR:CidrBlock}' \
                                    --output table
                                  
                                  # 2. Who holds the ENIs?
                                  aws ec2 describe-network-interfaces \
                                    --filters Name=subnet-id,Values=subnet-priv-use1a \
                                    --query 'NetworkInterfaces[].{S:Status,D:Description,O:Attachment.InstanceOwnerId}' \
                                    --output table
                                  # 3. Free orphaned ENIs
                                  aws ec2 delete-network-interface --network-interface-id eni-0xx
                                  
                                  # 4. Add secondary CIDR + new subnet
                                  aws ec2 associate-vpc-cidr-block --vpc-id vpc-0xx \
                                    --cidr-block 100.64.0.0/16
                                  aws ec2 create-subnet --vpc-id vpc-0xx \
                                    --cidr-block 100.64.10.0/22 --availability-zone us-east-1a
                                  
                                  # 5. Switch EKS to prefix delegation (more IPs/instance)
                                  kubectl set env -n kube-system ds aws-node ENABLE_PREFIX_DELEGATION=true
                                  SimEC2 lifecycleS017 · 5/8

                                  Root cause

                                  1. Subnet sized at /27 for 4 instances.
                                  2. EKS warm pool consumed all secondary IPs without prefix delegation.
                                  3. Lambda VPC Hyperplane ENI sat in subnet for 40 min after last invocation.
                                  Gotcha: shrink subnets in code feels right but production VPC subnets are immutable in size. Use secondary CIDR.
                                  IaCEC2 lifecycleS017 · 6/8

                                  Fix

                                  resource "aws_vpc_ipv4_cidr_block_association" "secondary" {
                                    vpc_id     = aws_vpc.main.id
                                    cidr_block = "100.64.0.0/16"
                                  }
                                  resource "aws_subnet" "private_carrier" {
                                    count             = 3
                                    vpc_id            = aws_vpc.main.id
                                    cidr_block        = cidrsubnet("100.64.0.0/16", 6, count.index)
                                    availability_zone = local.azs[count.index]
                                    tags = merge(local.tags, { Tier="private-carrier", KubernetesCarrier="true" })
                                  }
                                  # enable prefix delegation in EKS
                                  resource "aws_eks_addon" "vpc_cni" {
                                    cluster_name = aws_eks_cluster.main.name
                                    addon_name   = "vpc-cni"
                                    configuration_values = jsonencode({
                                      env = { ENABLE_PREFIX_DELEGATION = "true" }
                                    })
                                  }
                                  IaC note: 100.64/10 (RFC 6598) is non-routable on the public internet but routable inside VPC; ideal for “extra IPs” without conflicting with corp 10/8.
                                  ConceptEC2 lifecycleS017 · 7/8

                                  Cheeky & prevention

                                  Cheeky #1

                                  Use VPC IPAM for centralized IP planning — alerts before exhaustion at OU scale.

                                  Cheeky #2

                                  For Lambda VPC, set EFS_DEPENDENCY_CHECK false + use VPC Lattice to bypass ENIs altogether.

                                  Cheeky #3

                                  Tag every detached ENI with OrphanCheck=true + Lambda cleans after 1h.

                                  Prevent

                                  CW alarm on subnet free IPs < 20%.

                                  Prevent

                                  Subnet sizing standard: never /27 in prod for EKS/ECS — minimum /22.

                                  Prevent

                                  Quarterly IP capacity review per VPC — growth forecast vs IPAM.

                                  LabEC2 lifecycleS017 · 8/8

                                  Interactive lab

                                  Lab S017: Find who's holding IPssimulated
                                  Objectives
                                    $
                                    0 / 0
                                    ConceptEC2 lifecycleS018 · 1/8

                                    Symptom — ASG instance-refresh stuck

                                    Observed

                                    • Instance refresh stops at 30%; one instance won't terminate.
                                    • API: OperationNotPermitted: The instance has termination protection.
                                    • Someone manually enabled it during an incident, never removed.

                                    Constraints

                                    • ASG auto-scaling protections also apply.
                                    • Two flags: instance-level DisableApiTermination + ASG protected_from_scale_in.
                                    VisualEC2 lifecycleS018 · 2/8

                                    Three places “protection” lives

                                    EC2 termination protect (the bug here) ASG scale-in protect Stop API protect resolution + auto-cleanup ⓘ hover for detail
                                    EC2 termination protection EC2 attr refresh stuck stuck Why it was set During an incident, SRE flipped flag 'Don't let ASG kill this; we're debugging' Forgot to remove after incident closed Next deploy hit it → refresh halted ASG scale-in protection ASG flag scale-in refresh OK When to use Stateful instances doing long-running batch work Want to protect from scale-in churn Still want refresh / patching to work This is the right primitive 90% of the time Stop API protect (different) stop attr stop blocked Common confusion Setting DisableApiStop doesn't block terminate Setting DisableApiTermination doesn't block stop Three flags · three different effects Always check ALL THREE during incident debug Resolution sequence 1. find 2. clear 3. resume 4. auto-clear 5. audit Cheeky escape hatch aws autoscaling start-instance-refresh --preferences SkipMatching=true Skips instances already on right LT Workaround when 1 protected pet exists OpsHold tag — convention + auto-cleanup SRE tagged inst Lambda cleaned Tag schema OpsHold = "true" OpsHoldSet = "2026-04-29T13:42:00Z" OpsHoldReason = "debugging-svc-domjoin" OpsHoldOwner = "alice@gc" OpsHoldExpires = "2026-04-30T13:42:00Z" Pre-deploy gate describe-auto-scaling-group + per-instance attribute Show no instance with protection > 0 Else fail deploy with explanation Catches scenario before it bites Runbook reminder Don't use termination protection on cattle For one-off debugging: Use ASG standby state instead enter-standby + exit-standby EventBridge audit Pattern: ec2:ModifyInstanceAttribute with DisableApiTermination=true Routes to Slack #ops-changes with caller info Catches the human-error class at action time
                                    SimEC2 lifecycleS018 · 3/8

                                    Hypotheses

                                    #HypothesisDisprove
                                    H1EC2 DisableApiTermination=truedescribe-instance-attribute --attribute disableApiTermination
                                    H2ASG instance protect from scale-indescribe-auto-scaling-instances
                                    H3Lifecycle hook stuck waitingdescribe-lifecycle-hooks
                                    Cheeky: instance-refresh respects DisableApiTermination; you must clear it on the protected instance OR use --skip-matching if AMI is identical anyway.
                                    LabEC2 lifecycleS018 · 4/8

                                    Diagnose

                                    # 1. Check both flags
                                    aws ec2 describe-instance-attribute --instance-id i-0xx \
                                      --attribute disableApiTermination
                                    
                                    aws autoscaling describe-auto-scaling-instances \
                                      --instance-ids i-0xx \
                                      --query 'AutoScalingInstances[].ProtectedFromScaleIn'
                                    # 2. Disable EC2 termination protection
                                    aws ec2 modify-instance-attribute --instance-id i-0xx \
                                      --no-disable-api-termination
                                    
                                    # 3. Disable ASG scale-in protect
                                    aws autoscaling set-instance-protection \
                                      --instance-ids i-0xx \
                                      --auto-scaling-group-name orders-asg \
                                      --no-protected-from-scale-in
                                    
                                    # 4. Resume refresh
                                    aws autoscaling resume-processes \
                                      --auto-scaling-group-name orders-asg
                                    SimEC2 lifecycleS018 · 5/8

                                    Root cause

                                    1. During an incident, an SRE flipped DisableApiTermination=true on a known-good instance — to ensure ASG didn't kill it during diagnosis.
                                    2. Forgot to remove the flag.
                                    3. Next deploy, instance-refresh ran, hit the flag, halted.
                                    Gotcha: any operational pinning should be ticketed with auto-cleanup. Use a tag like OpsHold=true + nightly Lambda that warns + removes after 24h.
                                    IaCEC2 lifecycleS018 · 6/8

                                    Fix

                                    # Tag-driven cleanup
                                    resource "aws_lambda_function" "ops_hold_cleanup" {
                                      function_name = "gc-ops-hold-cleanup"
                                      ...
                                    }
                                    resource "aws_cloudwatch_event_rule" "daily" {
                                      schedule_expression = "cron(0 8 * * ? *)"
                                    }
                                    # Lambda body (excerpt)
                                    for inst in ec2.describe_instances(Filters=[{Name='tag:OpsHold',Values=['true']}]):
                                        age = now - inst.tags.get('OpsHoldSet')
                                        if age > 24h:
                                            ec2.modify_instance_attribute(InstanceId=inst.id, DisableApiTermination=False)
                                            ec2.modify_instance_attribute(InstanceId=inst.id, Tags={'OpsHold':'cleared'})
                                            slack('cleared OpsHold on '+inst.id)
                                    IaC note: automation prevents leftover state from hardening into mystery failures during the next deploy.
                                    ConceptEC2 lifecycleS018 · 7/8

                                    Cheeky & prevention

                                    Cheeky #1

                                    Instance refresh --skip-matching ignores instances already on the right LT version — a workaround when one stuck pet exists.

                                    Cheeky #2

                                    Use warm pool for fast scale-up; pool members aren't in service so refresh issues isolate.

                                    Cheeky #3

                                    SSM Automation doc GC-ClearOpsHold — one click clears all flags + tags.

                                    Prevent

                                    EventBridge on ModifyInstanceAttribute with DisableApiTermination=true → tag instance + Slack.

                                    Prevent

                                    Pre-deploy gate: describe-auto-scaling-group shows no instance with protection > 0; if any, fail deploy.

                                    Prevent

                                    Runbook: don't use termination protection on cattle. Use ASG protected_from_scale_in for the rare case.

                                    LabEC2 lifecycleS018 · 8/8

                                    Interactive lab

                                    Lab S018: Find & clear the protectionsimulated
                                    Objectives
                                      $
                                      0 / 0
                                      ConceptEC2 lifecycleS019 · 1/8

                                      Symptom — auto-recovery never fires

                                      Observed

                                      • Instance went unhealthy due to underlying host issue. CW alarm StatusCheckFailed_System never fired the recovery action.
                                      • Alarm in INSUFFICIENT_DATA for the last 6h.

                                      Constraints

                                      • Alarm action: arn:aws:automate:us-east-1:ec2:recover.
                                      • EC2 type c6i.4xlarge supports recovery.
                                      • Metric data missing: instance not reporting.
                                      VisualEC2 lifecycleS019 · 2/8

                                      Why metrics stop and alarm goes INSUFFICIENT_DATA

                                      healthy state failure path (current) fixed alarm config auto-recovery + ASG fallback ⓘ hover for detail
                                      Failure timeline — metrics stop, alarm decides healthy CW metrics host failure alarm eval stuck never fires SLO breach The trap: 'missing' is the default CloudWatch default treat_missing_data is 'missing' For failure-detection alarms, 'missing' is exactly wrong Set 'breaching' — missing data treated as breach → ALARM fires Set 'notBreaching' for opposite intent (e.g., 'metric should rarely arrive') Most-common alarm misconfig in AWS Route A: fix the alarm CW alarm fixed recovery recovered Route B: built-in auto_recovery resource "aws_instance" "x" { maintenance_options { auto_recovery = "default" }} No alarm needed; AWS handles internally Layered remediation — auto-recovery, ASG fallback, meta-monitor recovery ASG replace TG health meta-monitor Recovery vs Replace tradeoffs Recovery: same ID, faster (~3 min), preserves IP+EBS Replace: new ID, slower (~5-10 min), fresh state Stateful workloads → prefer recovery Stateless ASG cattle → prefer replace Instance type support Auto Recovery requires EBS-only storage Instance store types (i3, c5d) cannot recover For those: must use ASG replace Game day validation Annual chaos test: simulate host failure aws ec2 force-stop · observe recovery Assert: recover < 5 min, alarm cleared Config rule: alarm-action-check cloudwatch-alarm-action-check — ensures actions present Custom rule: validate treat_missing_data on failure-detection alarms Catches misconfig in audit, not at incident time
                                      SimEC2 lifecycleS019 · 3/8

                                      Hypotheses

                                      #HypothesisDisprove
                                      H1treat_missing_data not breachingdescribe-alarms
                                      H2Recovery action wrong ARNcompare to AWS-supplied recovery ARN
                                      H3Instance type doesn't support recoverycheck supported list
                                      H4Alarm in different regionregion check
                                      Cheeky: EC2 Auto Recovery works only if the instance type uses EBS-only storage. Instance store types can't auto-recover.
                                      LabEC2 lifecycleS019 · 4/8

                                      Diagnose

                                      # 1. Alarm definition
                                      aws cloudwatch describe-alarms \
                                        --alarm-names orders-recover-i-0xx \
                                        --query 'MetricAlarms[].{T:TreatMissingData,A:AlarmActions,P:DatapointsToAlarm,E:EvaluationPeriods}'
                                      
                                      # 2. Last 6h metric data
                                      aws cloudwatch get-metric-statistics --namespace AWS/EC2 \
                                        --metric-name StatusCheckFailed_System \
                                        --dimensions Name=InstanceId,Value=i-0xx \
                                        --start-time -6h --end-time now --period 60 --statistics Maximum
                                      # 3. Fix the alarm
                                      aws cloudwatch put-metric-alarm \
                                        --alarm-name orders-recover-i-0xx \
                                        --metric-name StatusCheckFailed_System --namespace AWS/EC2 \
                                        --dimensions Name=InstanceId,Value=i-0xx \
                                        --statistic Maximum --period 60 --threshold 0 \
                                        --comparison-operator GreaterThanThreshold \
                                        --evaluation-periods 5 --datapoints-to-alarm 3 \
                                        --treat-missing-data breaching \
                                        --alarm-actions arn:aws:automate:us-east-1:ec2:recover
                                      SimEC2 lifecycleS019 · 5/8

                                      Root cause

                                      1. Alarm treat_missing_data=missing — default. When metrics stop, alarm goes INSUFFICIENT_DATA, no action.
                                      2. Auto-recovery requires alarm to enter ALARM; can never get there with missing data treated as missing.
                                      Gotcha: the “missing data” trap is the most common alarm misconfig in AWS. Always pick breaching for failure-detection alarms.
                                      IaCEC2 lifecycleS019 · 6/8

                                      Fix

                                      resource "aws_cloudwatch_metric_alarm" "recover" {
                                        for_each = toset(var.instance_ids)
                                        alarm_name          = "recover-${each.value}"
                                        comparison_operator = "GreaterThanThreshold"
                                        evaluation_periods  = 5
                                        datapoints_to_alarm = 3
                                        metric_name         = "StatusCheckFailed_System"
                                        namespace           = "AWS/EC2"
                                        period              = 60
                                        statistic           = "Maximum"
                                        threshold           = 0
                                        treat_missing_data  = "breaching"
                                        alarm_actions       = ["arn:aws:automate:us-east-1:ec2:recover"]
                                        dimensions = { InstanceId = each.value }
                                      }
                                      IaC note: alarms-per-instance via for_each — or use EC2 instance auto-recovery (default behavior) which doesn't require explicit alarms on supported types.

                                      Even simpler

                                      resource "aws_instance" "x" {
                                        maintenance_options { auto_recovery = "default" }
                                      }
                                      ConceptEC2 lifecycleS019 · 7/8

                                      Cheeky & prevention

                                      Cheeky #1

                                      Use maintenance_options.auto_recovery=default — AWS handles it without alarms.

                                      Cheeky #2

                                      Pair recovery with a CW alarm on StatusCheckFailed_Instance → reboot action; covers OS hangs that aren't host failures.

                                      Cheeky #3

                                      For ASG, prefer health-check replace: set ASG health_check_type=ELB; ASG kills + replaces unhealthy instances faster than recovery.

                                      Prevent

                                      CW alarm meta-monitor: alarm on any alarm in INSUFFICIENT_DATA > 30 min.

                                      Prevent

                                      Config rule: cloudwatch-alarm-action-check + custom check on treat_missing_data.

                                      Prevent

                                      Annual game day: simulate host failure (stop force) and assert recovery completes < 5 min.

                                      LabEC2 lifecycleS019 · 8/8

                                      Interactive lab

                                      Lab S019: Fix the alarmsimulated
                                      Objectives
                                        $
                                        0 / 0
                                        ConceptEC2 lifecycleS020 · 1/8

                                        Symptom — instance invisible to SSM

                                        Observed

                                        • SSM Console shows 0 managed instances for the new fleet.
                                        • aws ssm describe-instance-information returns empty.
                                        • SSM Session Manager fails with Target not found.

                                        Constraints

                                        • SSM agent installed on AMI — should auto-start.
                                        • Instance has IAM role with AmazonSSMManagedInstanceCore.
                                        • VPC has VPC endpoints for ssm/ssmmessages/ec2messages.
                                        VisualEC2 lifecycleS020 · 2/8

                                        What SSM agent needs to register

                                        required: 3 VPCe + IAM + agent private DNS off (the bug) fix path ⓘ hover for detail
                                        SSM registration requirements EC2 VPCe ssm VPCe ssmmessages VPCe ec2messages IAM role NTP Three VPCes — all required ssm: control-plane (registration, inventory, parameter store) ssmmessages: data-plane (Session Manager, Run Command) ec2messages: legacy data-plane (older Run Command paths) Missing any one → some SSM features broken; agent may still register Sometimes agent registers via ssm but Session Manager fails (ssmmessages missing) The bug — private DNS off privateDNS=false no NAT How private DNS works private_dns_enabled = true: VPC's R53 resolver overrides public AWS DNS ssm.us-east-1 resolves to VPCe ENI's private IP private_dns_enabled = false: Must explicitly use vpce-xxx.ssm.us-east-1.vpce.amazonaws.com Fix sequence + observability enable DNS restart agent validated DHM org-wide Config rule ec2-instance-managed-by-systems-manager Flags any EC2 not registered with SSM Org-wide aggregator surfaces non-compliant Synthetic test Hourly: launch test instance in sandbox Validate SSM ping < 5 min Alarm if registration takes longer VPCe SG audit VPCe SG must allow 443 from workload SG SCP forbids privateDnsEnabled=false on critical VPCes EventBridge audit on ModifyVpcEndpoint Pre-deploy checklist (CI gate) Validate: ssm/ssmmessages/ec2messages VPCe present + private_dns_enabled Validate: instance role has AmazonSSMManagedInstanceCore Validate: VPCe SG allows 443 from workload SG — checkov rule
                                        SimEC2 lifecycleS020 · 3/8

                                        Hypotheses

                                        #HypothesisDisprove
                                        H1SSM agent not runningsystemctl status amazon-ssm-agent
                                        H2IAM role missing or permdescribe-instance-attribute --attribute iamInstanceProfile
                                        H3VPC endpoint SG denies 443 from instanceSG ingress rules
                                        H4VPCe DNS resolution offprivate_dns_enabled
                                        H5Time skew breaks SigV4chronyc sources
                                        Cheeky: curl -v https://ssm.us-east-1.amazonaws.com from the instance — if it resolves to a 10.x address, VPCe is in play. If a public IP, NAT path. Either should give 403 — that's good (TLS works).
                                        LabEC2 lifecycleS020 · 4/8

                                        Diagnose

                                        # 1. From instance (via console direct connect or get-system-log)
                                        sudo systemctl status amazon-ssm-agent
                                        sudo journalctl -u amazon-ssm-agent --no-pager | tail -50
                                        sudo cat /var/log/amazon/ssm/amazon-ssm-agent.log | tail -100
                                        
                                        # 2. Reach the endpoint
                                        getent hosts ssm.us-east-1.amazonaws.com
                                        curl -v https://ssm.us-east-1.amazonaws.com 2>&1 | head -10
                                        # 3. From console
                                        aws ssm describe-instance-information \
                                          --filters Key=InstanceIds,Values=i-0xx
                                        # empty -> agent never registered
                                        
                                        # 4. Inspect VPCe SG
                                        aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-0xx \
                                          --query 'VpcEndpoints[].{S:Groups,P:PrivateDnsEnabled,SG:Groups}'
                                        SimEC2 lifecycleS020 · 5/8

                                        Root cause

                                        1. VPC endpoint for ssm had private_dns_enabled=false (someone disabled it for a debug last week).
                                        2. ssm.us-east-1.amazonaws.com resolved to public IP, instance had no NAT → agent couldn't register.
                                        Gotcha: when private DNS is disabled on an interface VPCe, you need to explicitly use the per-AZ DNS name (e.g. vpce-xxx.ssm.us-east-1.vpce.amazonaws.com). SSM agent doesn't support that path.
                                        IaCEC2 lifecycleS020 · 6/8

                                        Fix

                                        resource "aws_vpc_endpoint" "ssm" {
                                          vpc_id              = aws_vpc.main.id
                                          service_name        = "com.amazonaws.us-east-1.ssm"
                                          vpc_endpoint_type   = "Interface"
                                          subnet_ids          = local.private_subnets
                                          security_group_ids  = [aws_security_group.vpce.id]
                                          private_dns_enabled = true   # <-- must be true
                                          tags                = local.tags
                                        }
                                        # repeat for ssmmessages, ec2messages
                                        # SG for vpce
                                        resource "aws_security_group_rule" "vpce_ingress" {
                                          type              = "ingress"
                                          from_port         = 443
                                          to_port           = 443
                                          protocol          = "tcp"
                                          source_security_group_id = aws_security_group.workload.id
                                          security_group_id = aws_security_group.vpce.id
                                        }
                                        IaC note: all 3 SSM endpoints needed (ssm, ssmmessages, ec2messages). Missing any one breaks Run Command or Session Manager.
                                        ConceptEC2 lifecycleS020 · 7/8

                                        Cheeky & prevention

                                        Cheeky #1

                                        SSM Fleet Manager can self-heal SSM agent state on managed instances — useful when agents drift.

                                        Cheeky #2

                                        Use SSM Default Host Management Configuration — auto-attaches the SSM role + agent to all EC2 in the account, no manual setup.

                                        Cheeky #3

                                        Ship CloudWatch Agent + SSM as a single Image Builder component; consistent across AMIs.

                                        Prevent

                                        Config rule: ec2-instance-managed-by-systems-manager; non-compliant means missing.

                                        Prevent

                                        Synthetic launch + SSM ping every hour from sandbox; alarm if registration takes > 5 min.

                                        Prevent

                                        Pre-deploy checklist gates: SSM ping, VPCe health, IAM role attached.

                                        LabEC2 lifecycleS020 · 8/8

                                        Interactive lab

                                        Lab S020: Get the agent registeredsimulated
                                        Objectives
                                          $
                                          0 / 0
                                          ConceptEC2 lifecycleS021 · 1/8

                                          Symptom — CW reboot races ASG kill

                                          Observed

                                          • App stops responding; CW alarm fires reboot action.
                                          • ~30s later, ASG ELB health check marks instance unhealthy and terminates.
                                          • Both happen — instance reboots and gets killed.
                                          • Net effect: 60s outage instead of 10s reboot.

                                          Constraints

                                          • ASG health_check_type=ELB, grace 60s.
                                          • CW alarm reboot action on app metric.
                                          • No coordination between actions.
                                          VisualEC2 lifecycleS021 · 2/8

                                          Two healers fighting

                                          healer A: CW reboot healer B: ASG ELB-health collision (the bug) fix: pick one ⓘ hover for detail
                                          Timeline — two healers compete on the same fault T+0 hang T+15 alarm T+30 reboot rebooting T+45 ASG kill terminated 60s outage Why two healers fight CW alarm fires fast (metric breach in 1-2 min) ALB TG health uses different threshold (3 of 3 in 30s) Both react to same root cause but with different intent Reboot action doesn't notify ASG; ASG doesn't pause for reboot Pick ONE healer per fault — race conditions vanish Fix — one healer per fault Path A: cattle — ASG ELB-health only single healer Path B: pets — ASG standby + reboot isolation Detection + audit — find latent two-healer races audit alarms audit Lambda churn alarm game day Decoupled health checks /readyz: app ready to serve (ALB health check) /livez: app process alive (kept simpler) /health: external composite for status pages Different paths reduce false positives Service runbook Document: one canonical healer per failure mode 'App hung' → ASG ELB-health replace 'Host failure' → EC2 auto-recovery (different fault) 'OOM' → ExitOnOutOfMemory + ASG replace SCP-style sanity check Custom Config rule: any CW alarm with action ec2:reboot/terminate on ASG instance flagged non-compliant Auto-ticket; 7-day SLA Cattle, not pets Reboot is a 2010-era pattern; replace is faster recovery No state pollution; new instance is clean Aligns with immutable infrastructure principles
                                          SimEC2 lifecycleS021 · 3/8

                                          Hypotheses

                                          #HypothesisDisprove
                                          H1Reboot during reboot terminatesCW + ASG events same instance
                                          H2Grace period too shortASG health_check_grace_period
                                          H3Health check uses /health that goes 503 in shutdown phaseapp shutdown logs
                                          Cheeky: aws autoscaling describe-scaling-activities + CW alarm history side-by-side reveal who killed first.
                                          LabEC2 lifecycleS021 · 4/8

                                          Diagnose

                                          # 1. ASG events
                                          aws autoscaling describe-scaling-activities \
                                            --auto-scaling-group-name orders-asg \
                                            --max-records 5
                                          
                                          # 2. CW alarm history
                                          aws cloudwatch describe-alarm-history \
                                            --alarm-name orders-app-reboot --max-records 10
                                          # 3. Disable the dual healing
                                          aws cloudwatch delete-alarms --alarm-names orders-app-reboot
                                          
                                          # 4. Tune ASG grace
                                          aws autoscaling update-auto-scaling-group \
                                            --auto-scaling-group-name orders-asg \
                                            --health-check-grace-period 180
                                          SimEC2 lifecycleS021 · 5/8

                                          Root cause

                                          1. Two automated healers (CW reboot + ASG ELB health) fired in sequence.
                                          2. ASG grace expired during the reboot window → instance considered failed.
                                          3. Result: avoidable 60s outage, ASG churn, and noisy alerts.
                                          Gotcha: never have two automatic remediation paths on the same fault — they fight. Pick the cheaper-to-correct one (ASG replace) and remove the other.
                                          IaCEC2 lifecycleS021 · 6/8

                                          Fix

                                          # Drop the reboot alarm; rely on ASG ELB health
                                          # removed: aws_cloudwatch_metric_alarm.orders_app_reboot
                                          
                                          resource "aws_autoscaling_group" "orders" {
                                            health_check_type         = "ELB"
                                            health_check_grace_period = 180
                                            ...
                                          }
                                          IaC note: cattle, not pets. If the app misbehaves, replace the instance — faster recovery, no state pollution. Reboot was a 2010-era pattern.

                                          Health endpoint refinement

                                          # app /health returns 200 when ready; 503 when draining
                                          # ALB target group draining waits 30s
                                          ConceptEC2 lifecycleS021 · 7/8

                                          Cheeky & prevention

                                          Cheeky #1

                                          If reboot is necessary (kernel state), use ASG standby: put instance in standby, reboot, return to service. ASG won't terminate during standby.

                                          Cheeky #2

                                          Decouple readiness from liveness. ALB health checks readiness; ASG checks liveness via instance status checks. Less false positives.

                                          Cheeky #3

                                          Add CW alarm on the count of InstanceRefresh events; pages if ASG is churning.

                                          Prevent

                                          Audit: any CW alarm with action ec2:reboot or ec2:terminate on instances inside an ASG → warn.

                                          Prevent

                                          Service runbook: one canonical healer per failure mode.

                                          Prevent

                                          Game day: simulate hung app; assert single healer fires.

                                          LabEC2 lifecycleS021 · 8/8

                                          Interactive lab

                                          Lab S021: Identify the dual-healer racesimulated
                                          Objectives
                                            $
                                            0 / 0
                                            Batch 2 · 20 scenarios · Security Groups & NACLs

                                            Security Groups, NACLs & stateful gotchas

                                            SG vs NACL semantics, ephemeral ports, cross-VPC SG references via RAM, prefix-list explosions, ENI per-SG limits, drift from console edits, restricted-tier NACLs blocking VPCe replies. All diagrams use the AWS standard iconography from now on.
                                            S022 → S041 160 slides 20 lab terminals
                                            ConceptSG/NACLS022 · 1/8

                                            Symptom — SG allows SSH but bastion still cannot connect

                                            Observed

                                            • SSH from corp jumphost to bastion 10.20.0.42:22 hangs.
                                            • SG sg-bastion-ingress allows 22/tcp from pl-corp-onprem.
                                            • Packet capture on bastion: SYN arrives, SYN-ACK leaves, but client never gets it.
                                            • Other instances in the same subnet behave the same way.

                                            Constraints

                                            SubnetDMZ subnet-dmz-use1a (10.20.0.0/24)
                                            NACLnacl-dmz applied to subnet
                                            NACL outbound rulerecently “hardened”: allow 80/443 to 0/0; deny all else
                                            SGstateful (allows return automatically)
                                            NACLstateless (return must be explicitly allowed)
                                            Note: SGs are stateful, NACLs are not. A “security hardening” that locks NACL outbound to 80/443 is the #1 trigger of this class of failure.
                                            VisualSG/NACLS022 · 2/8

                                            Where the SYN-ACK gets dropped

                                            request hop (allowed) SYN-ACK dropped at NACL out NACL ruleset (stateless) SG (stateful, fine) ⓘ hover for detail
                                            On-prem corp jumphost :54321 (eph) RHEL 8: 32768-60999 Win: 49152-65535 DX TGW Region us-east-1 gc-prod-app · VPC 10.20.0.0/16 us-east-1a DMZ subnet 10.20.0.0/24 · nacl-dmz applied bastion 10.20.0.42:22 sg-bastion (OK) NACL nacl-dmz (stateless — rules evaluated in order) in: 100 allow 22 from 10.0.0.0/8 · 200 allow 80/443 from 0/0 out: 100 allow 443 to 0/0 · 200 allow 80 to 0/0 out: * deny all else ⇐ SYN-ACK on dst:54321 hits HERE FIX: out 110 allow tcp 1024-65535 to 10.0.0.0/8 Private subnet 10.20.10.0/24 · default NACL (allow all) orders-api RDS VPCe Why no NACL issue here? Private subnet uses default NACL: allow all in/out Lesson: NACLs are broad strokes; SGs do per-resource Tightening NACLs without thinking about ephemeral = #1 trap Stateful vs stateless cheatsheet SG is STATEFUL: allow request, return auto-allowed NACL is STATELESS: must allow both directions explicitly If you don't allow ephemeral 1024-65535 outbound → everything breaks
                                            SimSG/NACLS022 · 3/8

                                            Hypotheses

                                            #HypothesisDisprove
                                            H1SG ingress missing 22describe-security-groups
                                            H2NACL inbound 22 missingdescribe-network-acls
                                            H3NACL outbound ephemeral missing — SYN-ACK dropsNACL out rules
                                            H4Asymmetric routing (TGW back path different)RT inspection
                                            H5Host firewall (Defender/iptables)local netsh / iptables -L
                                            Cheeky: NACL behaviour is identical to a Linux iptables policy in stateless mode. If you ever wrote “iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT” without remembering the conntrack module, you've made this exact mistake.
                                            LabSG/NACLS022 · 4/8

                                            Diagnose

                                            # 1. Reachability Analyzer (config-time check)
                                            aws ec2 create-network-insights-path \
                                              --source $(jumphost-eni) --destination i-bastion \
                                              --protocol TCP --destination-port 22
                                            
                                            # 2. NACL outbound rules
                                            aws ec2 describe-network-acls \
                                              --filters Name=association.subnet-id,Values=subnet-dmz-use1a \
                                              --query 'NetworkAcls[].Entries[?Egress==`true`]'
                                            
                                            # 3. Live capture on bastion
                                            sudo tcpdump -ni any host 10.0.5.10 and port 22 -w /tmp/cap.pcap
                                            # 4. Traffic Mirroring (cheeky)
                                            aws ec2 create-traffic-mirror-session \
                                              --network-interface-id eni-bastion \
                                              --traffic-mirror-target-id tmt-0xx \
                                              --traffic-mirror-filter-id tmf-0xx --session-number 1
                                            
                                            # 5. VPC Flow Log query
                                            aws logs filter-log-events --log-group /aws/vpc/flow \
                                              --filter 'srcaddr=10.20.0.42 dstaddr=10.0.5.10 action=REJECT'
                                            Watch for: action=REJECT in flow log with high dst port (32768-60999) — that's the NACL drop signature.
                                            SimSG/NACLS022 · 5/8

                                            Root cause

                                            1. Security team “hardened” the DMZ NACL outbound to 80/443 only in a recent control change.
                                            2. Bastion accepted SYN, sent SYN-ACK to client's ephemeral port (e.g. 54321).
                                            3. NACL outbound denied port 54321 → SYN-ACK dropped at the subnet boundary.
                                            4. Client retransmits, all dropped, eventual timeout. Symptom: “SSH hangs.”
                                            Gotcha: NACLs are stateless. Allowing inbound 22 means nothing if the egress side blocks the ephemeral return port. SGs hide this for you because they're stateful.
                                            IaCSG/NACLS022 · 6/8

                                            Fix — ephemeral range on NACL outbound

                                            resource "aws_network_acl_rule" "dmz_out_ephem" {
                                              network_acl_id = aws_network_acl.dmz.id
                                              egress         = true
                                              rule_number    = 110
                                              rule_action    = "allow"
                                              protocol       = "6"
                                              cidr_block     = "10.0.0.0/8"     # corp space
                                              from_port      = 1024
                                              to_port        = 65535
                                            }
                                            resource "aws_network_acl_rule" "dmz_out_https" {
                                              network_acl_id=aws_network_acl.dmz.id; egress=true
                                              rule_number=120; rule_action="allow"
                                              protocol="6"; cidr_block="0.0.0.0/0"; from_port=443; to_port=443
                                            }
                                            IaC standard: our shared module modules/nacl-tier always emits an ephemeral-out rule (1024-65535) to the corp prefix and to 0/0. The “hardening” PR that broke this should have failed the module's test suite.

                                            Lint guard

                                            # tflint custom rule
                                            rule "aws_network_acl_must_have_ephemeral_egress" {
                                              enabled = true
                                              message = "NACL egress must include 1024-65535 (ephemeral)"
                                            }
                                            ConceptSG/NACLS022 · 7/8

                                            Cheeky & prevention

                                            Cheeky #1

                                            Many shops just “don't use NACLs except for big strokes” (block known bad ports/CIDRs at the subnet boundary). Use SGs as the per-resource policy. Less footgun surface.

                                            Cheeky #2

                                            Linux ephemeral range varies. RHEL 8: 32768-60999. Older: 1024-65535. Windows: 49152-65535. NACLs need to cover all 1024-65535 for safety.

                                            Cheeky #3

                                            VPC Reachability Analyzer evaluates NACL config; it would have caught this before deploy — if anyone had run it.

                                            Prevent

                                            Pre-deploy: every NACL change runs Reachability Analyzer for representative source/dest pairs. CI fail on REJECT.

                                            Prevent

                                            VPC Flow Log alarm on action=REJECT to subnet-internal IPs over 5-min baseline.

                                            Prevent

                                            Module-only NACLs — SCP forbids creating aws_network_acl outside the module path.

                                            LabSG/NACLS022 · 8/8

                                            Interactive lab

                                            Lab S022: Find which side of the NACL drops the returnsimulated
                                            Objectives
                                              $
                                              0 / 0
                                              ConceptSG/NACLS023 · 1/8

                                              Symptom — cross-VPC SG reference is rejected

                                              Observed

                                              • Terraform creates SG rule referencing sg-12345 from another VPC (TGW peer): InvalidGroup.NotFound.
                                              • Same source SG works for in-VPC peers.
                                              • Both VPCs in same region; both in same AWS account.

                                              Constraints

                                              • SG-to-SG references work over VPC peering (same region) but require RAM share or VPC-peering relationship.
                                              • SGs cannot be referenced over TGW by default unless the SG is shared via RAM.
                                              • Rules with SG ref do not work across regions or across accounts unless RAM-shared.
                                              Note: the cleaner pattern is to use customer-managed prefix lists referenced in SGs, not cross-VPC SG IDs.
                                              VisualSG/NACLS023 · 2/8

                                              SG ref vs Prefix List vs RAM

                                              broken: cross-VPC SG ID fix: managed prefix list + RAM cross-VPC connectivity ⓘ hover for detail
                                              Region us-east-1 — cross-VPC SG ref attempt VPC A (prod-app) 10.20.0.0/16 orders-api sg-orders Tried: ingress 22 from sg-bastion InvalidGroup.NotFound SG ID is VPC-scoped — cross-VPC ref fails Even within same account & same region VPC B (shared-svcs) 10.30.0.0/16 jumphost sg-bastion SG-to-SG ref CAN work across VPC peering If: same region + RAM-shared SGs CANNOT work across TGW (L3 routing only) TGW L3 routing Solution: managed prefix list + RAM pl-shared-svcs RAM share sg-orders EC2 Why this works everywhere Prefix list = list of CIDRs, not SG IDs SG rule references the LIST id (not contents) Works across: VPC peering, TGW, multi-region, multi-acct Updates to entries propagate to all consumers Decision matrix — SG-to-SG vs prefix list vs SG via RAM SG-to-SG (within VPC) Same VPC: works always (e.g., ALB SG → target SG) Tight, dynamic, no CIDR maintenance Best for in-VPC service-to-service SG-to-SG (cross-VPC peered) Works if VPCs peered + same region Cross-account: requires RAM share of SG Cross-region: not supported Cross-TGW: NOT supported Prefix list (universal) Works across VPC peering, TGW, regions, accounts CIDR-based; no SG ID dependency Centralized in gc-network · RAM-shared Default choice for cross-VPC SG via RAM (specialized) Share an SG via RAM to other accounts Recipients can reference it in their SG rules Useful for shared LB SGs across accounts Adds operational coupling; use sparingly Recommended pattern In-VPC: SG-to-SG (tightest) Cross-VPC same region peered: SG-to-SG via RAM (sometimes) Cross-TGW / cross-region / cross-acct: prefix list (default) Quota optimization bonus Prefix list ref counts as 1 SG rule (not 50) 50-CIDR pl + 5 SG rules = 5 rules toward quota Massive quota saver for partner-IP scenarios
                                              SimSG/NACLS023 · 3/8

                                              Hypotheses

                                              #HypothesisDisprove
                                              H1SG ID typodescribe-security-groups
                                              H2SG in different VPC; no RAM shareram list-resources
                                              H3VPCs in different regionscompare regions
                                              H4Provider in Terraform points to wrong accountprovider alias check
                                              Cheeky: instead of debating cross-VPC SG references, switch to customer-managed prefix lists as the source-of-truth address pool. SGs reference prefix list IDs natively.
                                              LabSG/NACLS023 · 4/8

                                              Diagnose

                                              # 1. Confirm SG exists where you think
                                              aws ec2 describe-security-groups --group-ids sg-bastion \
                                                --query 'SecurityGroups[].{V:VpcId,O:OwnerId,N:GroupName}'
                                              
                                              # 2. Is it RAM-shared?
                                              aws ram list-resources --resource-owner SELF \
                                                --resource-type ec2:SecurityGroup
                                              # 3. Switch to prefix-list approach
                                              aws ec2 create-managed-prefix-list \
                                                --address-family IPv4 --max-entries 50 \
                                                --prefix-list-name pl-shared-svcs \
                                                --entries 'Cidr=10.30.0.0/16,Description=shared'
                                              
                                              # 4. SG rule using prefix list
                                              aws ec2 authorize-security-group-ingress --group-id sg-orders \
                                                --ip-permissions 'IpProtocol=tcp,FromPort=22,ToPort=22,
                                                  PrefixListIds=[{PrefixListId=pl-0xx}]'
                                              SimSG/NACLS023 · 5/8

                                              Root cause

                                              1. SGs are scoped to a VPC. SG-to-SG ingress rules require both SGs in the same VPC or RAM-shared SG with VPC-peering or shared subnets.
                                              2. Across TGW (which is L3, not L2), SG IDs are not dereferenceable without RAM.
                                              3. Most teams discover this only when scaling beyond a single VPC.
                                              Gotcha: RAM-shared SGs add operational coupling: shared owner, shared deletes, blast radius. Prefix lists usually win on operational simplicity.
                                              IaCSG/NACLS023 · 6/8

                                              Fix

                                              # Owner: gc-network repo
                                              resource "aws_ec2_managed_prefix_list" "shared_svcs" {
                                                name           = "pl-shared-svcs"
                                                address_family = "IPv4"
                                                max_entries    = 50
                                                entry { cidr="10.30.0.0/16"; description="shared-svcs VPC" }
                                                tags = local.tags
                                              }
                                              resource "aws_ram_resource_share" "pl" {
                                                name       = "gc-prefix-lists"
                                                principals = [for o in local.spoke_ous : o]
                                              }
                                              resource "aws_ram_resource_association" "pl_shared" {
                                                resource_share_arn = aws_ram_resource_share.pl.arn
                                                resource_arn       = aws_ec2_managed_prefix_list.shared_svcs.arn
                                              }
                                              # Consumer: gc-prod-app repo
                                              data "aws_ec2_managed_prefix_list" "shared" {
                                                name = "pl-shared-svcs"
                                              }
                                              resource "aws_security_group_rule" "orders_in" {
                                                type              = "ingress"
                                                from_port         = 22
                                                to_port           = 22
                                                protocol          = "tcp"
                                                prefix_list_ids   = [data.aws_ec2_managed_prefix_list.shared.id]
                                                security_group_id = aws_security_group.orders.id
                                              }
                                              IaC note: RAM-share once in gc-network; consume by name in every spoke. Prefix list updates fan out automatically.
                                              ConceptSG/NACLS023 · 7/8

                                              Cheeky & prevention

                                              Cheeky #1

                                              Prefix lists count as 1 rule per prefix list ref in SG limits, regardless of entry count. Big quota saver.

                                              Cheeky #2

                                              SG references work cross-VPC for ALB target groups inside same account — useful for shared-services LBs.

                                              Cheeky #3

                                              For ECS tasks across services, share an SG via RAM and reference it directly — cleaner than maintaining IPs.

                                              Prevent

                                              Custom checkov rule: forbid source_security_group_id with hardcoded sg-* across VPCs — force prefix list use.

                                              Prevent

                                              Spoke account README documents pl-* names + how to consume.

                                              Prevent

                                              RAM resource share reviewed quarterly; unused shares removed.

                                              LabSG/NACLS023 · 8/8

                                              Interactive lab

                                              Lab S023: Cross-VPC SG ref — pivot to prefix listsimulated
                                              Objectives
                                                $
                                                0 / 0
                                                ConceptSG/NACLS024 · 1/8

                                                Symptom — RulesPerSecurityGroupLimitExceeded

                                                Observed

                                                • Terraform apply fails: RulesPerSecurityGroupLimitExceeded.
                                                • SG already has 60 inbound rules; adding more for new microservice.
                                                • Soft limit 60 in/out; hard limit raised to 250 with quota request.

                                                Constraints

                                                • Quota L-0EA8095F per SG.
                                                • Increasing causes proportional ENI quota cost.
                                                • Org has 5 SGs per ENI default (can be raised to 16).
                                                Note: SGs × rules × ENIs is one product — raising any one tightens the others.
                                                VisualSG/NACLS024 · 2/8

                                                Rules × ENIs × quota math

                                                quota math (the limit) SG × ENI math collapse via prefix list ⓘ hover for detail
                                                Quota math — rules × SGs × ENI ENI SG#1 60r SG#2 60r SG#3 60r SG#4 60r SG#5 60r (cap) Default: 5 SGs/ENI × 60 rules = 300 effective rules per ENI Quota dimensions: rules-per-SG (L-0EA8095F), SGs-per-ENI (L-2AFB9258), SGs-per-VPC (10000) Raising rules-per-SG to 250 reduces SGs-per-ENI to 1 (the trade is real) Raising SGs-per-ENI to 16 reduces rules-per-SG to 50 Quota raise is rarely the right answer; collapse rules instead prefix-list ref counts as 1 SG rule regardless of entries inside Collapse via prefix list 50 partners prefix list 1 SG rule Quota delta Before: 50 rules + 10 other = 60 (at cap) After: 1 rule + 10 other = 11 (way under) Headroom for future expansion Future adds = pl update; SG unchanged No SG rule churn; no instance refresh needed Optimization techniques + monitoring audit find dups auto-sync alarm Other collapse tricks Port ranges: 80-8080 = 1 rule (vs 7 rules) CIDR aggregation: 10.0.0.0/16 vs 256 /24s SG-to-SG ref counts as 1 (vs N CIDRs) Don't over-merge ports (22-3389 leaks RDP) Pre-merge tflint Detect >3 individual aws_security_group_rule with same protocol+port → suggest prefix list pattern Catches anti-pattern at PR time Quarterly SG cleanup Identify SGs with 0 ENI attachments Identify rules referencing deleted SGs Retire dead app SGs VPC Lattice as alternative L7 service-to-service auth (IAM-based) SGs only on ingress edge; service-to-service via Lattice auth policy Removes SG rule pressure entirely for inter-service
                                                SimSG/NACLS024 · 3/8

                                                Hypotheses

                                                #HypothesisDisprove
                                                H160-rule cap reachedcount rules per SG
                                                H2Each microservice CIDR added separatelylook for repeating /32s
                                                H3Could be folded into prefix listcheck duplicate descriptions
                                                Cheeky: describe-security-group-rules --filters Name=group-id,Values=sg-x | jq '.SecurityGroupRules | length' tells you exactly how close to the limit you are.
                                                LabSG/NACLS024 · 4/8

                                                Diagnose

                                                # 1. Count rules
                                                aws ec2 describe-security-group-rules \
                                                  --filters Name=group-id,Values=sg-orders \
                                                  --query 'length(SecurityGroupRules)'
                                                
                                                # 2. Find duplicates / mergeable rules
                                                aws ec2 describe-security-group-rules \
                                                  --filters Name=group-id,Values=sg-orders \
                                                  --query 'SecurityGroupRules[].{P:IpProtocol,F:FromPort,T:ToPort,C:CidrIpv4}' \
                                                  | jq 'group_by(.C) | map({C:.[0].C, ports: map([.F,.T])})'
                                                # 3. Quota
                                                aws service-quotas get-service-quota \
                                                  --service-code vpc --quota-code L-0EA8095F
                                                
                                                # 4. Request raise
                                                aws service-quotas request-service-quota-increase \
                                                  --service-code vpc --quota-code L-0EA8095F --desired-value 250
                                                SimSG/NACLS024 · 5/8

                                                Root cause

                                                1. App SG accreted CIDR rules per microservice (40+ /32s for partner IPs).
                                                2. Default 60-rule SG cap reached.
                                                3. Add another /32 → deny.
                                                Gotcha: raising the per-SG rule cap reduces SGs-per-ENI quota in the same proportion. Read the docs — the trade is real.
                                                IaCSG/NACLS024 · 6/8

                                                Fix — collapse to prefix list

                                                resource "aws_ec2_managed_prefix_list" "partners" {
                                                  name           = "pl-orders-partners"
                                                  address_family = "IPv4"
                                                  max_entries    = 60
                                                
                                                  dynamic "entry" {
                                                    for_each = var.partner_ips
                                                    content {
                                                      cidr        = "${entry.value.cidr}/32"
                                                      description = entry.value.name
                                                    }
                                                  }
                                                }
                                                resource "aws_security_group_rule" "orders_partners" {
                                                  type              = "ingress"; from_port=443; to_port=443; protocol="tcp"
                                                  prefix_list_ids   = [aws_ec2_managed_prefix_list.partners.id]
                                                  security_group_id = aws_security_group.orders.id
                                                }
                                                IaC note: 50 partners now consume 1 SG rule. Future partner adds = prefix list update only, no SG rule churn.

                                                Bonus

                                                # Lambda updates pl-orders-partners from a CSV in S3 daily
                                                ConceptSG/NACLS024 · 7/8

                                                Cheeky & prevention

                                                Cheeky #1

                                                Prefix list version increments on every change — SGs auto-reference latest. No SG churn.

                                                Cheeky #2

                                                Don't over-merge ports. 22-3389 looks tight but lets RDP through where you only wanted SSH. Be specific.

                                                Cheeky #3

                                                Use VPC Lattice for L7 service-to-service auth where possible — SGs only on ingress edge.

                                                Prevent

                                                CW alarm: per-SG rule count > 50 (warn) / > 58 (alert) via Config rule.

                                                Prevent

                                                Pre-merge tflint: detect >3 individual aws_security_group_rule with same protocol+port → suggest prefix list.

                                                Prevent

                                                Quarterly SG cleanup: deduplicate, collapse, retire dead apps.

                                                LabSG/NACLS024 · 8/8

                                                Interactive lab

                                                Lab S024: Count rules and propose collapsesimulated
                                                Objectives
                                                  $
                                                  0 / 0
                                                  ConceptSG/NACLS025 · 1/8

                                                  Symptom — ALB targets unhealthy

                                                  Observed

                                                  • ALB shows targets unhealthy with reason Health checks failed.
                                                  • App responds locally (curl localhost:8080/health returns 200).
                                                  • From bastion, app responds.
                                                  • From ALB to target? No.

                                                  Constraints

                                                  • ALB has its own SG (sg-alb-orders).
                                                  • Target SG (sg-orders-task) ingress doesn't allow ALB SG.
                                                  • Common pattern miss when migrating from CLB to ALB.
                                                  VisualSG/NACLS025 · 2/8

                                                  ALB SG → target SG reference

                                                  DMZ subnet (ALB) Private subnet (target) missing SG rule (the bug) fix: ALB SG → target SG ⓘ hover for detail
                                                  Symptom: ALB target group health = unhealthy · reason 'Health checks failed' · app responds locally + via bastion VPC gc-prod-app 10.20.0.0/16 DMZ subnet 10.20.0.0/24 ALB orders sg-alb-orders ingress: 443 from 0/0 egress: 8080 to sg-orders-task ALB SG outbound CAN target other SG (same VPC) Health check probes hit target on TG port probe :8080 DROPPED Private subnet 10.20.10.0/24 orders task sg-orders-task in: 8080 from sg-alb-orders — MISSING in: 22 from pl-corp (ssh) out: 0/0 (default) App receives no probes → passive 'unhealthy' bastion SSH OK (22) via SSM Session curl localhost:8080 200 OK (locally) Misleads operator Fix — ALB SG as source sg-orders-task sg-alb-orders Why this is canonical Tightest scope: only ALB ENIs can reach Survives ALB IP changes (ENIs change all the time) SG-to-SG ref counts as 1 rule (vs N CIDRs) Same VPC = no RAM ceremony Default in Terraform module — no manual fix Module pattern — alb-target-sg-pair as a unit Terraform module: alb-target-sg-pair Inputs: vpc_id, target_port, alb_listener_port Emits: aws_security_group.alb + aws_security_group.target Plus: aws_security_group_rule wiring (target ingress from ALB) Module test: deploy + assert healthy in < 60s Means: app teams can't forget to wire SGs Health-check check aws elbv2 describe-target-health \ --target-group-arn ... \ --query 'TargetHealthDescriptions[].{T:Target.Id,S:State,R:Reason}' Reason 'Target.FailedHealthChecks' = SG/network Reason 'Target.HealthCheckPath' = app /health 404 Custom Config rule For each target group: validate target SG has ingress from the ALB SG, on the target group's port Non-compliant resource flagged in dashboard Synthetic monitor Probe ALB hostname every 1 min Alarm on any TG with > 0 unhealthy targets > 2 min NLB note NLB has SG support since 2023; older NLBs may not have one (add via set-security-groups)
                                                  SimSG/NACLS025 · 3/8

                                                  Hypotheses

                                                  #HypothesisDisprove
                                                  H1Target SG missing ingress from ALB SGdescribe-security-groups
                                                  H2Health-check path 404app log
                                                  H3Target port mismatch (TG 80 vs app 8080)describe-target-groups
                                                  H4Target deregistereddescribe-target-health
                                                  Cheeky: Target SG “allow ALB SG” pattern is symmetric — you can have many ALBs share sg-alb-edge; targets need only one allow rule.
                                                  LabSG/NACLS025 · 4/8

                                                  Diagnose

                                                  # 1. Target health
                                                  aws elbv2 describe-target-health \
                                                    --target-group-arn arn:...:targetgroup/orders \
                                                    --query 'TargetHealthDescriptions[].{T:Target.Id,S:TargetHealth.State,R:TargetHealth.Reason,D:TargetHealth.Description}'
                                                  
                                                  # 2. ALB and target SGs
                                                  aws elbv2 describe-load-balancers --names orders-alb \
                                                    --query 'LoadBalancers[].SecurityGroups'
                                                  aws ec2 describe-security-groups --group-ids sg-orders-task \
                                                    --query 'SecurityGroups[].IpPermissions'
                                                  # 3. Add ingress from ALB SG
                                                  aws ec2 authorize-security-group-ingress \
                                                    --group-id sg-orders-task \
                                                    --ip-permissions 'IpProtocol=tcp,FromPort=8080,ToPort=8080,
                                                      UserIdGroupPairs=[{GroupId=sg-alb-orders}]'
                                                  
                                                  # 4. Re-check health (~30s)
                                                  sleep 30
                                                  aws elbv2 describe-target-health --target-group-arn ...
                                                  SimSG/NACLS025 · 5/8

                                                  Root cause

                                                  1. Migrating from Classic LB (which used amazon-elb SG — deprecated) to ALB.
                                                  2. Old target SG had amazon-elb/sg-AAA as ingress.
                                                  3. New ALB SG sg-alb-orders not added; ALB-to-target traffic dropped.
                                                  Gotcha: ALB doesn't use the legacy “amazon-elb” SG — you must explicitly create your own ALB SG and reference it.
                                                  IaCSG/NACLS025 · 6/8

                                                  Fix

                                                  resource "aws_security_group" "alb_orders" {
                                                    vpc_id = aws_vpc.main.id
                                                    ingress { from_port=443; to_port=443; protocol="tcp"; cidr_blocks=["0.0.0.0/0"] }
                                                    egress  { from_port=0;  to_port=0;  protocol="-1"; cidr_blocks=["0.0.0.0/0"] }
                                                    tags    = local.tags
                                                  }
                                                  resource "aws_security_group_rule" "orders_task_in_alb" {
                                                    type                     = "ingress"
                                                    from_port                = 8080
                                                    to_port                  = 8080
                                                    protocol                 = "tcp"
                                                    source_security_group_id = aws_security_group.alb_orders.id
                                                    security_group_id        = aws_security_group.orders_task.id
                                                    description              = "ALB orders → task"
                                                  }
                                                  IaC note: wrap this in a module modules/alb-target-sg-pair that emits both SGs as a unit. Module test: target SG must have ingress from the LB SG.

                                                  Health-check ingress hint

                                                  # Some pre-flight: TG health-check hits port:
                                                  # Make sure SG allows that exact port (might differ from app port)
                                                  ConceptSG/NACLS025 · 7/8

                                                  Cheeky & prevention

                                                  Cheeky #1

                                                  Set ALB enable_cross_zone_load_balancing + tune deregistration_delay to 30s. Faster blue/green flips.

                                                  Cheeky #2

                                                  For NLB targets, SGs apply only when target = instance. Target = IP uses subnet/SG of the IP's ENI — trickier auditing.

                                                  Cheeky #3

                                                  NLB has SG support since 2023 — older NLBs may not have one attached. Add via set-security-groups.

                                                  Prevent

                                                  Health-check probes from synthetic; alarm if any TG has > 0 unhealthy > 2 min.

                                                  Prevent

                                                  Module test: deploy ALB+target, assert healthy in < 60s.

                                                  Prevent

                                                  Custom Config rule: TG must have at least one ingress rule referencing parent ALB SG.

                                                  LabSG/NACLS025 · 8/8

                                                  Interactive lab

                                                  Lab S025: Add the ALB SG to target SG ingresssimulated
                                                  Objectives
                                                    $
                                                    0 / 0
                                                    GlobalCorp Playbook Runs in your browser · cached locally
                                                    1 / 1 TF/