You don't just read about a problem. You see it diagrammed, watch it play out, then run the commands yourself in a guided terminal.
| # | Slide | Layer |
|---|---|---|
| 1 | Symptom & business impact | Concept |
| 2 | Architecture diagram (where it's blocking) | Visual |
| 3 | Hypotheses & debug method | Sim |
| 4 | Diagnose — native commands | Lab |
| 5 | Root cause | Sim |
| 6 | Fix — commands | Lab |
| 7 | IaC change (Terraform) | IaC |
| 8 | Cheeky / non-obvious | Concept |
| 9 | Prevent / monitor | Concept |
| 10 | Interactive lab terminal | Lab |
| Marker | Meaning |
|---|---|
| tip | Best practice or non-obvious trick |
| note | Common assumption to verify |
| gotcha | Bites you in production |
| IaC | Terraform / IaC change to lock in the fix |
| Account | Alias | Purpose |
|---|---|---|
| 111111111111 | gc-mgmt | Org root, billing |
| 222222222222 | gc-log-archive | Central CloudTrail/Config logs |
| 333333333333 | gc-audit | Security Hub, GuardDuty admin |
| 444444444444 | gc-network | TGW, R53 Resolver, DX, network firewall |
| 555555555555 | gc-shared-svcs | FSx, AD connector, jump hosts |
| 666666666666 | gc-prod-app | Customer-facing microservices |
| 777777777777 | gc-prod-data | RDS, ElastiCache, FSx for SQL backups |
| 888888888888 | gc-stg-app | Staging mirror |
| 999999999999 | gc-dev-app | Dev workloads |
| 121212121212 | gc-tools-cicd | GitHub Actions OIDC, artifacts |
| 131313131313 | gc-finsub-prod | FinSub subsidiary prod |
| 141414141414 | gc-retailsub-poc | RetailSub PoC |
Every scenario diagram is built from this fixed set: 70+ AWS service icons in standard category colours, 7 container styles (region/AZ/VPC/subnet tiers/account/on-prem), and ~10 composite mini-diagrams. No bespoke geometry per scenario — updates here propagate everywhere.
ip route get <ip>ss -tnptracepath -nmtr --report-wide -c 50tcpdump -ni any host X and port Ygetent hosts / resolvectlcurl http://169.254.169.254/... (IMDSv2 token)cloud-init-output.logamazon-ssm-agent logsTest-NetConnection -CommonTCPPort RDP -ComputerName XGet-NetRoute / Get-NetIPConfigurationResolve-DnsName -Server Xnltest /sc_query:corp.globalcorp.localdsregcmd /statusklist / klist purgew32tm /query /status /verboseGet-WinEvent -LogName System| # | Category | Count | Status |
|---|---|---|---|
| 01 | EC2 lifecycle & provisioning | 20 | live (S002–S021) |
| 02 | Security Groups & NACLs | 20 | partial (S022–S025 of 20) |
| 03 | IAM, instance roles, cross-account | 20 | queued |
| 04 | VPC, subnets, route tables | 15 | queued |
| 05 | Transit Gateway & cross-acct routing | 15 | queued |
| 06 | DNS / Route 53 / Resolver | 15 | queued |
| 07 | Active Directory & domain join | 15 | queued |
| 08 | Systems Manager (SSM) | 15 | queued |
| # | Category | Count | Status |
|---|---|---|---|
| 09 | VPC endpoints | 15 | queued |
| 10 | CloudWatch Logs & Metrics | 15 | queued |
| 11 | Load balancers (ALB/NLB) | 15 | queued |
| 12 | Backup & DR | 15 | queued |
| 13 | FSx & storage | 10 | queued |
| 14 | Okta / federation / MFA | 15 | queued |
| 15 | Terraform / IaC operations | 20 | queued |
| 16 | Org / SCP / Landing Zone | 10 | queued |
Add-Computer fails with An Active Directory domain controller for the domain could not be contacted. App team is blocked.i-0abc123 reaches running.System > NETLOGON > Event 5719: This computer was not able to set up a secure session with a domain controller in domain corp.globalcorp.local because of the following: The remote procedure call was cancelled. cloud-init-output.log: Add-Computer : An Active Directory domain controller (AD DC) for the domain corp.globalcorp.local could not be contacted.
| Constraint | Implication |
|---|---|
| SCP requires IMDSv2 | User-data must use IMDSv2 token call |
SCP denies iam:CreateUser | Domain-join uses a vaulted AD service account, not IAM user |
| VPC has no public subnet | No direct internet to secretsmanager.amazonaws.com — must use VPCe |
| R53 Resolver outbound rule for corp.globalcorp.local | DNS query must reach Resolver → corp DCs over TGW |
SG sg-prod-private-windows-ingress | Egress 53/88/389/445/etc to pl-corp-onprem |
| # | Layer | Hypothesis | Falsify with |
|---|---|---|---|
| H1 | Identity | Instance can't fetch svc-domjoin secret (no role / wrong KMS) | aws sts get-caller-identity · get-secret-value from instance |
| H2 | Reach | SG egress missing port to corp DCs | Reachability Analyzer 53/88/389/445 |
| H3 | Reach | TGW route table missing 10.0.0.0/8 | aws ec2 search-tgw-routes |
| H4 | Reach | Inspection FW dropping RPC dynamic 49152–65535 | FW logs + Test-NetConnection -Port 50000 |
| H5 | DNS | R53 Resolver rule missing/disassociated from VPC | list-resolver-rule-associations |
| H6 | Auth | Time skew > 5 min → Kerberos refuses | w32tm /query /status |
| H7 | Auth | Service account svc-domjoin lacks Add Computer right on target OU | DC sec event 4625 + delegation review |
Resolve-DnsName -Type SRV _ldap._tcp.dc._msdcs.corp.globalcorp.localTest-NetConnection 10.10.0.10 -Port 389 & -Port 445.# confirm identity + assume into prod-app aws sts get-caller-identity aws sts assume-role --role-arn arn:aws:iam::666666666666:role/FedAppDev \ --role-session-name s001 --query Credentials # Reachability Analyzer (config-time check) aws ec2 create-network-insights-path \ --source i-0abc123 --destination 10.10.0.10 \ --protocol TCP --destination-port 445 aws ec2 start-network-insights-analysis --network-insights-path-id nip-...
aws ssm send-command --instance-ids i-0abc123 \
--document-name AWS-RunPowerShellScript \
--parameters 'commands=[
"Resolve-DnsName -Type SRV _ldap._tcp.dc._msdcs.corp.globalcorp.local",
"Test-NetConnection 10.10.0.10 -Port 389",
"Test-NetConnection 10.10.0.10 -Port 445",
"Test-NetConnection 10.10.0.10 -Port 50000",
"w32tm /query /status",
"klist"
]'
Name Type TTL Section ---- ---- --- ------- _ldap._tcp.dc._msdcs.corp.globalcorp... SRV 600 Answer Priority : 0 Port : 389 Target : dc1-ny.corp... ComputerName : 10.10.0.10 RemoteAddress : 10.10.0.10 RemotePort : 389 TcpTestSucceeded : True ComputerName : 10.10.0.10 RemotePort : 50000 TcpTestSucceeded : False # <-- this is our smoking gun
# Network Firewall log query in CloudWatch Logs Insights fields @timestamp, src, dst, dst_port, action | filter dst = "10.10.0.10" | filter src like /^10\.20\.10\./ | filter action = "DROP" | stats count() by dst_port
gc-corp-ad-allow permits the well-known AD ports: 53/88/389/636/3268/3269/445.Add-Computer succeeds (DNS & LDAP work), but the secure channel setup needs a dynamic RPC port → FW drops → client retries → eventually surfaces as RPC was cancelled.| Misleading signal | Why |
|---|---|
| NETLOGON 5719 fires for many causes | Same code, 7+ root causes |
| Reachability Analyzer (cfg) passes 445 | Doesn't test stateful FW dynamic ports |
| FW logs in central acct | Devs lack read on log-archive |
| Sometimes works (race) | If RPC happens to negotiate <49152, it passes |
50000-50099 and only allow that subset through the firewall.# 1. Patch the Network Firewall rule group to permit # the AD RPC dynamic range OR a constrained sub-range. aws network-firewall describe-rule-group \ --rule-group-name gc-corp-ad-allow --type STATEFUL --query 'RuleGroup' > rg.json # 2. Append rule (Suricata syntax) and update. # pass tcp $HOME_NET any -> $CORP_DC any (msg:"AD RPC dyn"; \ # flow:to_server,established; sid:1000201; rev:1; \ # dst_port:[49152:65535];) aws network-firewall update-rule-group \ --rule-group-name gc-corp-ad-allow --type STATEFUL \ --rule-group file://rg.json --update-token <token>
aws ssm send-command --instance-ids i-0abc123 \
--document-name AWS-RunPowerShellScript \
--parameters 'commands=[
"Test-NetConnection 10.10.0.10 -Port 50000",
"Add-Computer -DomainName corp.globalcorp.local -Credential $cred -Restart"
]'
pl-corp-onprem.Owner=netsec, References=AD-DC-Ports.# in gc-network repo git checkout -b fix/nf-ad-rpc-dynamic # edit modules/inspection-fw/rules/ad-allow.suricata git diff --stat terraform fmt -recursive terraform validate terraform plan -var-file=envs/us-east-1.tfvars \ -target=module.inspection_fw.aws_networkfirewall_rule_group.ad_allow
resource "aws_networkfirewall_rule_group" "ad_allow" { capacity = 200 name = "gc-corp-ad-allow" type = "STATEFUL" rule_group { rules_source { rules_string = file("${path.module}/rules/ad-allow.suricata") } rule_variables { ip_sets { key = "CORP_DC" ip_set { definition = ["10.10.0.10/32", "10.10.0.11/32"] } } } } tags = local.tags }
# existing well-known AD ports... pass tcp $HOME_NET any -> $CORP_DC [53,88,389,636,3268,3269,445] \ (msg:"AD well-known"; sid:1000101; rev:2;) # NEW: AD RPC dynamic range (constrained to 50000-50099) pass tcp $HOME_NET any -> $CORP_DC 50000:50099 \ (msg:"AD RPC dyn constrained"; flow:to_server,established; sid:1000201; rev:1;)
checkov + tflint green; CI posts plan to PR.@netsec-leads, @ad-leads.apply.yml assumes FedTerraformApply via OIDC, runs terraform apply.var.ad_rpc_range = "50000-50099" with the same value referenced by the per-spoke SG egress modules — one source of truth.You never RDP'd into the host. aws ssm send-command with AWS-RunPowerShellScript ran the diagnostics with full audit (CloudTrail + SSM Run Command history). For interactive work, aws ssm start-session --target i-... is your shell.
It evaluates config: SG, NACL, route table, TGW. It does not evaluate stateful Network Firewall rules. If RA says reachable but it's not, suspect inspection FW, host firewall, MTU, or asymmetric routes.
Devs often lack read on gc-log-archive. We expose a cross-account CloudTrail Lake datastore + a read-only log-insights view via aws-vault assume-role onto a FedNetTroubleshoot role — can query FW logs without copying data.
Default AD RPC dynamic range is huge. We pin DCs to 50000–50099 and document it as the AD-team contract. FW rule shrinks from a 16k-port hole to 100 ports.
SCP enforces IMDSv2. Our user-data fetches the IMDSv2 token first, then the role creds, then the secret. If you script the IMDSv1 way it silently 401s and the domain-join “just” fails.
$tk = Invoke-RestMethod -Headers @{"X-aws-ec2-metadata-token-ttl-seconds"="300"} \
-Method PUT -Uri "http://169.254.169.254/latest/api/token"
Add DomainJoin=required tag at launch. A maintenance window step waits for that tag, then runs the GC-JoinOnPremAD doc — lets you re-run domain join idempotently after a fix without rebuilding the host.
DroppedPackets with dimension StatefulRuleGroup=gc-corp-ad-allow — non-zero in 5 min → PagerDuty.gc-ad-health.required-network-firewall-rule-group-tags — rule groups must carry References=AD-DC-Ports tag (…so they show up in this audit).SSMRunCommand success/failure for GC-JoinOnPremAD; failures trigger a Lambda that posts diagnostics to Slack #ad-domain-join.Test-NetConnection matrix to all corp DCs every 5 min, emits CW custom metric.aws_networkfirewall_rule_group requires @ad-leads review (CODEOWNERS).hint, show, reset, or list at any time.failedaws ec2 run-instances returns an InstanceId.pending for ~8 min then transitions to shutting-down → terminated.Server.InternalError: Internal error on launch.| Item | Detail |
|---|---|
| AMI | shared from gc-tools-cicd (121212121212) |
| Root volume | EBS encrypted with customer KMS key in gc-tools-cicd |
| Launching account | gc-prod-app (666666666666) |
| Default EBS encryption | on, with account-default KMS key in 666... (different key) |
| Service role | AWSServiceRoleForAutoScaling |
Server.InternalError is the polite version of “something on the EC2 side blew up” — almost always EBS attach, ENI attach, or KMS.| # | Hypothesis | Disprove with |
|---|---|---|
| H1 | EBS attach fails — KMS cross-acct grant missing | describe-instance-attribute --attribute reason |
| H2 | ENI attach fails — subnet/AZ ran out of IPs | describe-subnets AvailableIpAddressCount |
| H3 | AZ capacity (Insufficient) | StateReasonMessage contains Insufficient capacity |
| H4 | Tenancy mismatch (dedicated host expired) | describe-host-reservations |
| H5 | SCP blocking iam:PassRole during launch | CloudTrail event RunInstances errorCode |
aws cloudtrail lookup-events \ --lookup-attributes AttributeKey=ResourceName,AttributeValue=i-0xx \ --max-results 5 --query 'Events[].CloudTrailEvent' \ | jq -r '.' | jq 'select(.errorCode!=null) | {errorCode,errorMessage}'
# 1. Pull the StateReason directly aws ec2 describe-instances --instance-ids i-0xx \ --query 'Reservations[].Instances[].{S:State.Name,R:StateReason}' # 2. Pull instance-status (more granular) aws ec2 describe-instance-status --instance-ids i-0xx \ --include-all-instances # 3. Inspect the snapshot encryption + KMS key aws ec2 describe-snapshots --snapshot-ids snap-0xx \ --query 'Snapshots[].{Enc:Encrypted,KMS:KmsKeyId,Owner:OwnerId}' # 4. Check key policy in the source account aws --profile gc-tools kms get-key-policy \ --key-id alias/tools-ami --policy-name default | jq
# 5. List grants on the key (look for our role) aws --profile gc-tools kms list-grants \ --key-id alias/tools-ami \ --query 'Grants[?contains(GranteePrincipal,`666666666666`)]' # 6. Try the decrypt directly with an exec-role on a test instance aws ssm send-command --instance-ids i-test \ --document-name AWS-RunShellScript \ --parameters 'commands=[ "aws kms describe-key --key-id arn:aws:kms:us-east-1:121212121212:key/aaa..." ]'
AccessDenied on kms:Decrypt with the principal AWSServiceRoleForAutoScaling — not the user/role that called RunInstances.ami-prod-base uses an encrypted snapshot backed by KMS key arn:aws:kms:us-east-1:121212121212:key/aaa-tools-ami.gc-prod-app, the launch goes through the service-linked role AWSServiceRoleForAutoScaling.kms:CreateGrant on the source key on behalf of EC2/EBS.kms:CreateGrant to aws-service-role/autoscaling.amazonaws.com.pending; EC2 retries the EBS detach/re-attach for ~8 min, then gives up → Server.InternalError.gc-tools-cicd repo)data "aws_iam_policy_document" "tools_ami_key" { statement { sid = "AllowSpokeAccountsToUseKey" actions = ["kms:Decrypt","kms:DescribeKey", "kms:ReEncrypt*","kms:GenerateDataKey*", "kms:CreateGrant"] principals { type="AWS"; identifiers=["arn:aws:iam::666666666666:root"] } resources = ["*"] condition { test = "StringEquals" variable = "kms:ViaService" values = ["ec2.us-east-1.amazonaws.com"] } } }
resource "aws_iam_role_policy" "asg_kms" { role = "AWSServiceRoleForAutoScaling" policy = jsonencode({ Version="2012-10-17", Statement=[{ Effect="Allow", Action=["kms:CreateGrant","kms:Decrypt", "kms:ReEncrypt*","kms:GenerateDataKey*", "kms:DescribeKey"], Resource="arn:aws:kms:us-east-1:121212121212:key/aaa-tools-ami" }] }) }
spoke_account_ids + kms_key_arn and emits both the key policy statement and the spoke IAM role policy from a single locals.tf source of truth.Use VPC Reachability Analyzer? No — this is KMS, not network. Use IAM Access Analyzer (cross-account) to surface keys exposed/granted across accounts before the launch even happens.
Pre-flight: aws ec2 run-instances --dry-run only checks the calling principal — not the EBS KMS chain. Bake an explicit kms:DescribeKey probe into your AMI promotion job.
If you can't change the source key policy, copy the AMI into the spoke account and re-encrypt with the local default key. The cost is double-storage; the win is no cross-acct grants to maintain.
Alarm on the EBS metric VolumeAttachFailures (custom via EventBridge on AttachVolume errorCode), routes to #platform-pager.
kms-cmk-not-scheduled-for-deletion + a custom rule that flags any KMS key shared cross-acct that is missing CreateGrant to autoscaling.amazonaws.com.
The AMI promotion product validates: launch perm + snapshot share + KMS grant exist for every spoke account in scope. If not, promotion fails.
cloud-init-output.log is empty on Linux; EC2Launch log shows “UserData persist disabled” on Windows.| Item | Detail |
|---|---|
| OS | Linux: AL2023; Windows: Server 2022 EC2Launch v2 |
| Launch source | Launch Template v6 (just promoted) |
| AMI | baked yesterday from custom pipeline |
| User-data | shell script (Linux) / <powershell>...</powershell> (Win) |
EC2Launch persist, (2) cloud-init disabled in baked image, (3) MIME multi-part malformed, (4) #cloud-config typo.| # | Hypothesis | Disproof |
|---|---|---|
| H1 | AMI baked with cloud-init semaphores already present (Linux) | ls /var/lib/cloud/sem/ on baked AMI |
| H2 | AMI baked w/o running EC2Launch SysprepInstance (Win) | EC2Launch.exe sysprep --shutdown log |
| H3 | User-data MIME multi-part missing Content-Type: text/x-shellscript | head -c 500 /var/lib/cloud/instance/user-data.txt |
| H4 | #cloud-config YAML invalid — cloud-init silently no-ops | cloud-init schema --system |
| H5 | Launch Template v6 has empty UserData field | describe-launch-template-versions |
cloud-init status --long and journalctl -u cloud-final.Get-Content C:\ProgramData\Amazon\EC2Launch\log\agent.log -Tail 200http://169.254.169.254/latest/user-data. If it's wrong there, the LT is wrong. If it's right there but didn't run, it's the AMI.# IMDSv2 token first TOKEN=$(curl -s -X PUT http://169.254.169.254/latest/api/token \ -H "X-aws-ec2-metadata-token-ttl-seconds: 60") # Rendered user-data curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \ http://169.254.169.254/latest/user-data | head -40 # cloud-init status + log cloud-init status --long sudo journalctl -u cloud-final --no-pager | tail -100 # Look for stale semaphores baked into AMI ls -la /var/lib/cloud/sem/ ls -la /var/lib/cloud/instance/
# EC2Launch v2 task state Get-Service AmazonSSMAgent Get-Content "C:\ProgramData\Amazon\EC2Launch\log\agent.log" ` -Tail 200 # Has UserData been marked “run-once”? Test-Path "C:\ProgramData\Amazon\EC2Launch\state\.run-once" # Re-arm UserData for next boot & "C:\Program Files\Amazon\EC2Launch\EC2Launch.exe" reset --schedule # Check the rendered user-data Invoke-WebRequest -Headers @{"X-aws-ec2-metadata-token"=$tk} ` -Uri "http://169.254.169.254/latest/user-data"
/var/lib/cloud/sem/ before aws ec2 create-image.scripts-user..run-once flag survived because the bake skipped EC2Launch.exe sysprep.# last provisioner before snapshot provisioner "shell" { inline = [ "sudo cloud-init clean --logs", "sudo rm -rf /var/lib/cloud/sem/* /var/lib/cloud/instance", "sudo rm -f /etc/machine-id && sudo touch /etc/machine-id", "sudo truncate -s 0 /etc/hostname", "sudo rm -rf /root/.ssh /home/ec2-user/.ssh" ] }
provisioner "powershell" { inline = [ "& 'C:\\Program Files\\Amazon\\EC2Launch\\EC2Launch.exe' reset", "& 'C:\\Program Files\\Amazon\\EC2Launch\\EC2Launch.exe' sysprep --shutdown" ] }
# in tools-cicd: promotion job resource "aws_ssm_parameter" "prod_ami_id" { name = "/gc/prod/ami/orders-api" type = "String" value = data.aws_ami.candidate.id lifecycle { precondition { condition = data.aws_ami.candidate.tags["PackerCleanupRun"] == "true" error_message = "AMI must be tagged PackerCleanupRun=true." } } }
PackerCleanupRun=true only after the cleanup step. Promotion to prod refuses without that tag.Force user-data to re-run on next boot via SSM — no console:aws ssm send-command --document-name AWS-RunShellScript --parameters 'commands=["sudo cloud-init clean --logs && sudo cloud-init init"]'
Switch the bake from “run user-data” to a cfn-init-style metadata pull. Move agent installs into Image Builder components — AMI ships ready, user-data only does instance-specific config.
Test in the bake: add a Packer post-processor that launches the AMI in a sandbox subnet with a probe user-data; if probe doesn't run, fail the build.
EventBridge rule on EC2 Instance State-change Notification with state=running and a Lambda that probes IMDS user-data & cloud-init status; emits CW custom metric UserDataExecuted=0/1.
Config rule: ec2-instance-managed-by-systems-manager (catches the wider problem — if your bake breaks SSM agent registration too).
Bake CI uploads a bake-report.json to a central bucket; the AMI promotion job validates the report contains cloud_init_clean: true.
aws sts get-caller-identity from inside the instance and gets old role FedAdmin — not the expected orders-api-task-role.AccessDenied; reads from a different bucket succeed.| Launch path | ASG → Launch Template (just bumped to v7) |
| Old profile | ip-orders-api-v1 with role FedAdmin (yes, sloppy) |
| New profile | ip-orders-api-v2 with role orders-api-task-role |
| EC2 metadata cache | credentials cached by SDK for ~6 hr |
| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Profile swap not yet applied to running instances | describe-iam-instance-profile-associations |
| H2 | Profile applied, but SDK cached old creds | IMDS shows new role; SDK shows old |
| H3 | Instance manually overrides creds via env | printenv | grep AWS_ |
| H4 | App container is using a task role from ECS not EC2 | curl $AWS_CONTAINER_CREDENTIALS_RELATIVE_URI |
| H5 | Profile resource policy denies AssumeRole on new role | CloudTrail AssumeRole error |
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/ returns the role currently associated with the instance profile. If that's wrong, IMDS hasn't flipped yet.# 1. What does the API say is associated? aws ec2 describe-iam-instance-profile-associations \ --filters Name=instance-id,Values=i-0xx \ --query 'IamInstanceProfileAssociations[].{S:State,Arn:IamInstanceProfile.Arn}' # 2. What does IMDS say? TK=$(curl -s -X PUT http://169.254.169.254/latest/api/token \ -H "X-aws-ec2-metadata-token-ttl-seconds: 60") curl -s -H "X-aws-ec2-metadata-token: $TK" \ http://169.254.169.254/latest/meta-data/iam/security-credentials/ # 3. SDK says what? aws sts get-caller-identity
# 4. Force-rotate by re-associating profile ASSOC=$(aws ec2 describe-iam-instance-profile-associations \ --filters Name=instance-id,Values=i-0xx \ --query 'IamInstanceProfileAssociations[0].AssociationId' --output text) aws ec2 replace-iam-instance-profile-association \ --association-id $ASSOC \ --iam-instance-profile Name=ip-orders-api-v2 # 5. Restart the app or SSM agent sudo systemctl restart orders-api sudo systemctl restart amazon-ssm-agent # 6. Confirm sleep 30 && aws sts get-caller-identity
AWS_EC2_METADATA_DISABLED=false + clearing the SDK's in-memory cache (boto3: session.get_credentials().refresh()).aws_iam_instance_profile from ip-orders-api-v1 to ip-orders-api-v2 on the launch template.terraform apply and assumed the fleet refreshed. ASG only refreshes on instance-refresh or scale events.replace-iam-instance-profile-association across the fleet.resource "aws_launch_template" "orders_api" { name_prefix = "orders-api-" iam_instance_profile { name = aws_iam_instance_profile.orders_v2.name } user_data = base64encode(templatefile("ud.sh.tftpl",{})) metadata_options { http_tokens = "required" http_put_response_hop_limit = 2 } tag_specifications { resource_type="instance"; tags=local.tags } } resource "aws_autoscaling_group" "orders_api" { ... launch_template { id=aws_launch_template.orders_api.id; version="$Latest" } instance_refresh { strategy = "Rolling" preferences { min_healthy_percentage = 90 } triggers = ["launch_template"] # <-- key } }
triggers = ["launch_template"] matters"launch_template" means any LT version bump (incl. instance profile) auto-rolls the fleet.min_healthy_percentage = 90, the rollout is safe.force_delete; some shops also gate on a checkov rule that requires http_tokens = "required" (IMDSv2) on every launch template — matches the SCP guardrail.Use aws sts get-caller-identity output as the source of truth in app boot logs. If the assumed role doesn't match expected, panic-exit the process — let ASG kill and replace.
One profile, multiple roles? Not possible. Instance profiles take exactly one role. Use STS chain (orders-api-bootstrap-role → orders-api-task-role) for runtime privilege downgrade.
Set SDK AWS_METADATA_SERVICE_TIMEOUT=2 + AWS_METADATA_SERVICE_NUM_ATTEMPTS=3 so credential rotation issues fail loudly, not silently.
Synthetic canary calls get-caller-identity every minute, emits a CW metric InstanceRoleId. Alarm if it diverges from expected for > 10 min.
EventBridge rule on AssociateIamInstanceProfile + ReplaceIamInstanceProfileAssociation → Slack #iam-changes.
Config rule: ec2-instance-profile-attached, plus a custom rule that asserts the profile name matches expected per environment tag.
Client.InvalidAMIID.NotFoundAn error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation.| Item | Value |
|---|---|
| Source acct | gc-tools-cicd (1212...) |
| Spoke acct | gc-prod-app (6666...) |
| AMI ID | ami-0abc123def456 |
| Region scope | us-east-1 only (no copy to eu-west-1) |
| # | Hypothesis | Disprove |
|---|---|---|
| H1 | AMI deregistered in source acct | aws ec2 describe-images --owners 1212 --image-ids ami-0xx |
| H2 | Launch permission revoked for spoke | describe-image-attribute --attribute launchPermission |
| H3 | Wrong region (LT in eu-west-1 referencing us-east-1 AMI) | region in LT vs ASG |
| H4 | Encrypted snapshot share missing (AMI is encrypted) | describe-snapshot-attribute |
DeregisterImage in the source account to disambiguate fast.# 1. Does the AMI exist for the source owner? aws ec2 describe-images --owners 121212121212 \ --image-ids ami-0abc123def456 \ --query 'Images[].{ID:ImageId,State:State,Name:Name}' \ --output table # 2. Was it deregistered? aws --profile gc-tools cloudtrail lookup-events \ --lookup-attributes AttributeKey=ResourceName,AttributeValue=ami-0abc123def456 \ --max-results 5 --query 'Events[].{T:EventTime,N:EventName,U:Username}'
# 3. Resolve via SSM parameter (what should happen) aws ssm get-parameter --name /gc/prod/ami/orders-api \ --query Parameter.Value --output text # 4. Check launch permission aws --profile gc-tools ec2 describe-image-attribute \ --image-id ami-0abc123def456 --attribute launchPermission # 5. Check snapshot share for encrypted AMI aws --profile gc-tools ec2 describe-snapshot-attribute \ --snapshot-id snap-0xx --attribute createVolumePermission
InvalidAMIID.NotFound.data "aws_ssm_parameter" "orders_ami" { name = "/gc/prod/ami/orders-api" } resource "aws_launch_template" "orders_api" { image_id = data.aws_ssm_parameter.orders_ami.value ... } # in tools-cicd: write parameter on every promotion resource "aws_ssm_parameter" "prod_ami" { name = "/gc/prod/ami/orders-api" type = "String" data_type = "aws:ec2:image" value = aws_ami_copy.candidate.id overwrite = true }
aws alias.data_type = "aws:ec2:image" makes SSM validate the AMI ID format and existence at write-time — you can't accidentally write a typo.EC2 LaunchTemplate accepts resolve:ssm:/gc/prod/ami/orders-api directly in image_id — no data source needed.
Keep the prior AMI: bake step writes /gc/prod/ami/orders-api/previous. Roll-back is one parameter version flip + ASG instance-refresh.
For DR region, mirror the parameter via Lambda in tools-cicd that runs on parameter change and copies to eu-west-1 with the eu-west-1 AMI ID.
EventBridge on DeregisterImage; if the deregistered AMI is referenced by any LT (search via Config), page the team.
Bake pipeline keeps last 5 AMIs plus any AMI referenced by a non-deleted LT (cross-account introspection).
ASG instance-refresh with auto-rollback on health failure: rolling out a bad AMI auto-reverts.
| Subnet | subnet-dmz-use1a (10.20.0.0/24) |
| Auto-assign IP | was set; SCP recently flipped it off org-wide |
| EIP allocation | requested in user-data via aws ec2 associate-address |
| Instance role | lacks ec2:AssociateAddress |
| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Subnet auto-assign disabled (SCP) | describe-subnets MapPublicIpOnLaunch |
| H2 | EIP not associated — user-data role missing perm | cloud-init log + IAM SimulatePrincipalPolicy |
| H3 | EIP exhausted — account quota | describe-account-attributes |
| H4 | EIP allocated in different region | describe-addresses --region eu-west-1 |
aws iam simulate-principal-policy --policy-source-arn <role> --action-names ec2:AssociateAddress --resource-arns * to prove perm without running anything.# 1. Subnet flag aws ec2 describe-subnets --subnet-ids subnet-dmz-use1a \ --query 'Subnets[].{Auto:MapPublicIpOnLaunch,IPs:AvailableIpAddressCount}' # 2. Instance state aws ec2 describe-instances --instance-ids i-0xx \ --query 'Reservations[].Instances[].{Pub:PublicIpAddress,Priv:PrivateIpAddress}' # 3. EIP available? aws ec2 describe-addresses \ --query 'Addresses[?AssociationId==`null`].PublicIp'
# 4. Did user-data fail silently? sudo cat /var/log/cloud-init-output.log | grep -i associate # 5. IAM perm proof aws iam simulate-principal-policy \ --policy-source-arn arn:aws:iam::666...:role/bastion-role \ --action-names ec2:AssociateAddress \ --resource-arns arn:aws:ec2:us-east-1:666...:elastic-ip/eipalloc-0xx # 6. Manually associate (fix attempt) aws ec2 associate-address \ --instance-id i-0xx --allocation-id eipalloc-0xx
NoAutoPublicIp denies RunInstances with AssociatePublicIpAddress=true on ENIs.associate-address but the bastion role only had ec2:DescribeAddresses, not ec2:AssociateAddress.set -e); the instance reached running with no public IP and no alarm.associate-address requires perm on the EIP allocation and on the instance ENI. Forgetting the ENI ARN is a frequent IAM cause.resource "aws_iam_role_policy" "bastion_eip" { role = aws_iam_role.bastion.id policy = jsonencode({ Version="2012-10-17", Statement=[{ Effect="Allow", Action=["ec2:AssociateAddress","ec2:DisassociateAddress", "ec2:DescribeAddresses"], Resource="*", Condition={ StringEquals={"aws:ResourceTag/Role"="bastion"} } }] }) }
# user-data hardening provisioner "file" { content = <<-EOT #!/usr/bin/env bash set -euo pipefail TOKEN=$(curl -s -X PUT \ http://169.254.169.254/latest/api/token \ -H "X-aws-ec2-metadata-token-ttl-seconds: 60") INST=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \ http://169.254.169.254/latest/meta-data/instance-id) aws ec2 associate-address \ --instance-id $INST --allocation-id ${var.eip_alloc_id} EOT }
set -euo pipefail and probe at end — aws ec2 describe-instances --instance-ids $INST --query 'Reservations[].Instances[].PublicIpAddress' — if blank, exit 1 → ASG kills.For bastions, prefer SSM Session Manager with port-forwarding — no public IP, no SSH key, fully audited.
Tag the EIP with Role=bastion + InstanceTag=bastion-prod. Use IAM aws:ResourceTag condition to scope ec2:AssociateAddress to only EIPs you own.
Avoid auto-assign public IP at the subnet level for any production tier — it's implicit and easy to leak. Always EIP+explicit assoc.
EventBridge on EC2 Instance State-change Notification · running + Lambda asserts PublicIpAddress != null for tagged bastions.
Synthetic canary: every 1 min, attempt nc -vz from external runner to bastion EIP:22; alarm on failure.
Config rule: elastic-ip-required-tags + custom rule that flags any unassociated EIP > 1 day old (cost).
10.20.10.99 → 10.20.10.121).orders-api.gcaws.internal still resolves to old IP for ~10 minutes.connection refused.| DNS | R53 PHZ gcaws.internal (associated to prod-app VPC) |
| Record | A orders-api → literal IP, TTL 60 |
| Update path | manual today; nobody updated the record |
| App | Java app, DNS cached forever (default sec.policy) |
networkaddress.cache.ttl; default is JVM · until process restart. Add -Dsun.net.inetaddr.ttl=60.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | PHZ record stale | list-resource-record-sets → compare to instance IP |
| H2 | Client DNS cache (Java) | jcmd <pid> VM.system_properties | grep ttl |
| H3 | Connection pool pinned to old socket | app metric / process restart fixes |
| H4 | NLB cross-zone disabled, target re-registration delayed | describe-target-health |
getent hosts orders-api.gcaws.internal on the host queries the resolver directly. If it's right but the app sees old IP, it's the JVM/SDK cache.# 1. Current PHZ record aws route53 list-resource-record-sets \ --hosted-zone-id Z0XXX \ --query "ResourceRecordSets[?Name=='orders-api.gcaws.internal.']" # 2. Current instance IP aws ec2 describe-instances --instance-ids i-0xx \ --query 'Reservations[].Instances[].PrivateIpAddress' # 3. From inside the host getent hosts orders-api.gcaws.internal dig +short orders-api.gcaws.internal
# 4. Update the record (immediate fix) aws route53 change-resource-record-sets \ --hosted-zone-id Z0XXX --change-batch file://upsert.json # 5. Force JVM to re-resolve (cheeky) sudo systemctl restart orders-api # cleanest # or via JMX: jcmd <pid> VM.system_properties | grep -i ttl # 6. Confirm ss -tnp | grep orders-api # new sockets to right IP
resource "aws_lb" "orders" { internal=true; load_balancer_type="application"; ... } resource "aws_lb_target_group" "orders" { ... } resource "aws_route53_record" "orders" { zone_id = data.aws_route53_zone.gcaws.zone_id name = "orders-api" type = "A" alias { name = aws_lb.orders.dns_name; zone_id = aws_lb.orders.zone_id; evaluate_target_health = true } }
resource "aws_cloudwatch_event_rule" "ec2_state" { event_pattern = jsonencode({ source = ["aws.ec2"], detail-type = ["EC2 Instance State-change Notification"], detail = { state=["running"] } }) } resource "aws_lambda_function" "phz_updater" { ... } # Lambda reads instance tag DnsName, upserts PHZ record
JVM DNS cache fix without app restart: java.security.Security.setProperty("networkaddress.cache.ttl","60") at boot, or env-level JAVA_OPTS=-Dsun.net.inetaddr.ttl=60.
Avoid stop/start on prod EC2 entirely — replace the instance via ASG instance-refresh. Cattle, not pets.
Need a stable IP without LB? Attach a secondary ENI you provision separately. ENI persists; primary IP is on the ENI; the ENI moves with the instance.
Synthetic canary on every named PHZ entry — periodically validates DNS-vs-target IP. Alarm on divergence > 5 min.
Config rule: route53-records-only-pointing-to-running-resources (custom).
SCP doesn't directly help. Lint rule in Terraform: forbid aws_route53_record with type=A and records=[] — force ALB-alias.
InsufficientInstanceCapacityWe currently do not have sufficient c6i.4xlarge capacity in the AZ you requested (us-east-1a).| ASG AZs | us-east-1a only (legacy) |
| Instance type | c6i.4xlarge only |
| Capacity reservation | none |
| SCP region lock | us-east-1, eu-west-1 |
| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Genuine AZ capacity shortage at peak | StateReason + EventBridge ASG events |
| H2 | Account-level on-demand vCPU quota hit | Service Quotas: L-1216C47A |
| H3 | Subnet IP exhausted (looks similar) | describe-subnets AvailableIpAddressCount |
| H4 | SCP denies new types beyond approved list | simulate run-instances |
aws ec2 run-instances --dry-run in each AZ — you'll see which AZ has capacity right now.# 1. ASG scaling activity aws autoscaling describe-scaling-activities \ --auto-scaling-group-name orders-asg --max-records 10 \ --query 'Activities[].{T:StartTime,S:StatusCode,M:StatusMessage}' # 2. Quota aws service-quotas get-service-quota \ --service-code ec2 --quota-code L-1216C47A # 3. Subnet IPs aws ec2 describe-subnets --subnet-ids subnet-priv-use1a \ --query 'Subnets[].AvailableIpAddressCount'
# 4. Probe other AZs (dry-run) for az in us-east-1a us-east-1b us-east-1c; do echo $az aws ec2 run-instances --dry-run --instance-type c6i.4xlarge \ --image-id ami-0xx --subnet-id $(subnet_for $az) \ --query Errors --output text 2>&1 | head -2 done # 5. ODCR check aws ec2 describe-capacity-reservations \ --filters Name=state,Values=active \ --query 'CapacityReservations[].{T:InstanceType,AZ:AvailabilityZone,Avail:AvailableInstanceCount}'
c6i.4xlarge at 9am peak (regional event affecting many tenants).resource "aws_autoscaling_group" "orders" { vpc_zone_identifier = local.private_subnets_3az min_size=4; desired=8; max_size=40 mixed_instances_policy { launch_template { launch_template_specification { launch_template_id = aws_launch_template.orders.id; version="$Latest" } } instances_distribution { on_demand_base_capacity = 4 on_demand_percentage_above_base_capacity = 50 spot_allocation_strategy = "capacity-optimized" } override { instance_type = "c6i.4xlarge" } override { instance_type = "c6a.4xlarge" } override { instance_type = "c5.4xlarge" } override { instance_type = "m6i.4xlarge" } } capacity_reservation_specification { capacity_reservation_preference = "open" } }
resource "aws_ec2_capacity_reservation" "orders_floor" { instance_type = "c6i.4xlarge" instance_platform = "Linux/UNIX" availability_zone = "us-east-1a" instance_count = 4 end_date_type = "unlimited" instance_match_criteria = "open" tags = local.tags }
Use attribute-based instance type selection (InstanceRequirements) instead of explicit type list — AWS picks any matching family; broadest capacity pool.
Spot Placement Score API tells you which region/AZ has best spot capacity right now for your shape — pre-flight checker for big batch jobs.
Reserve 4 ODCR seats. ASG burst beyond into on-demand, then spot. ICE in spot tier doesn't kill SLO because the floor is reserved.
CW alarm on ASG metric GroupPendingInstances > 0 for 5 min → PagerDuty.
Config + custom rule: ASGs in prod must specify mixed_instances_policy with at least 3 overrides.
Annual capacity review in Q4 — quotas raised, ODCR sized to next year traffic forecast.
RunInstances for missing CostCenterUnauthorizedOperation: with an explicit deny in a service control policy.| SCP | RequireTags on Workloads OU |
| Required tags | CostCenter, Owner, Env |
| Tag enforcement | at RunInstances via aws:RequestTag/CostCenter |
| Bypass | FedAdmin role does not bypass SCP |
| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Tag missing entirely | compare Terraform plan to SCP |
| H2 | Tag on instance but not on volume | review tag_specifications |
| H3 | Case mismatch | SCP aws:RequestTag/CostCenter is case-sensitive |
| H4 | Tag value not in allowed set (tag policy) | describe-organizations-policies |
| H5 | SCP applies to OU; account moved recently | list-parents + list-policies-for-target |
aws iam simulate-principal-policy doesn't evaluate SCPs. Use IAM Access Analyzer policy validation + AWS Organizations list-policies-for-target to spot which SCPs are in scope before debugging.# 1. Show the failing API call from CloudTrail aws cloudtrail lookup-events \ --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances \ --max-results 1 --query 'Events[].CloudTrailEvent' \ | jq '.[0] | fromjson | {errorCode, errorMessage}' # 2. Pull SCPs in scope aws --profile gc-mgmt organizations list-policies-for-target \ --target-id ou-xxx --filter SERVICE_CONTROL_POLICY
# 3. Validate Terraform tag plan terraform show -json tfplan | jq '.. | objects | select(.tag_specifications) | .tag_specifications' # 4. Test directly aws ec2 run-instances --image-id ami-0xx --instance-type t3.micro \ --tag-specifications 'ResourceType=instance,Tags=[{Key=CostCenter,Value=ENG-100},{Key=Owner,Value=alice},{Key=Env,Value=dev}]' \ 'ResourceType=volume,Tags=[{Key=CostCenter,Value=ENG-100}]' \ --dry-run
default_tags set CostCenter on the provider but the older AWS provider didn't propagate to Volume on RunInstances (only on the instance resource).RequireTags evaluates each TagSpecification separately — volume tag was missing → explicit deny.default_tags to all sub-resources. Upgrade or set tag_specifications explicitly per resource type.resource "aws_launch_template" "orders" { ... tag_specifications { resource_type="instance"; tags=local.tags } tag_specifications { resource_type="volume"; tags=local.tags } tag_specifications { resource_type="network-interface"; tags=local.tags } } provider "aws" { region = "us-east-1" default_tags { tags = local.tags } } locals { tags = { CostCenter = "ENG-100" Owner = "orders-team" Env = "prod" } }
# tflint plugin: aws-ruleset plugin "aws" { enabled=true; version="0.30.0"; source="github.com/terraform-linters/tflint-ruleset-aws" } rule "aws_resource_missing_tags" { enabled = true tags = ["CostCenter", "Owner", "Env"] }
Use tag policies (separate from SCP) to enforce case: tag_key: { @@assign: "CostCenter" }. Org standardizes “CostCenter” (not costcenter).
SCP message is generic. Add a Lambda that listens on UnauthorizedOperation CloudTrail events, parses the SCP, posts the missing-tag hint to the developer.
Pre-merge: terraform plan + parse JSON, check every tag_specifications contains required tags. Fail PR with the missing tag named.
Config rule: required-tags across resource types; non-compliance → auto-tag (where allowed) or remediation Lambda.
Service Catalog product abstracts tag handling so app teams can't forget.
Per-account README pre-commit hook: requires CostCenter in locals.tf.
HttpTokens=optional (allows IMDSv1).RunInstances when ec2:MetadataHttpTokens != required.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | LT has http_tokens=optional | describe-launch-template-versions |
| H2 | App SDK is too old for IMDSv2 | SDK version table check |
| H3 | Container hop-limit not 2 (ECS/k8s) | http_put_response_hop_limit |
| H4 | SCP not applied to this account — some other deny | list-policies-for-target |
# 1. LT current setting aws ec2 describe-launch-template-versions \ --launch-template-id lt-0xx --versions '$Latest' \ --query 'LaunchTemplateVersions[].LaunchTemplateData.MetadataOptions' # 2. Per-instance audit aws ec2 describe-instances --instance-ids i-0xx \ --query 'Reservations[].Instances[].MetadataOptions' # 3. Find IMDSv1 callers across fleet aws cloudwatch get-metric-statistics --namespace AWS/EC2 \ --metric-name MetadataNoToken --dimensions Name=InstanceId,Value=i-0xx \ --start-time -1h --end-time now --period 60 --statistics Sum
# 4. Update LT (new version) aws ec2 create-launch-template-version \ --launch-template-id lt-0xx \ --source-version '$Latest' \ --launch-template-data '{ "MetadataOptions":{ "HttpTokens":"required", "HttpEndpoint":"enabled", "HttpPutResponseHopLimit":2 } }' # 5. Live-modify existing instances aws ec2 modify-instance-metadata-options --instance-id i-0xx \ --http-tokens required --http-put-response-hop-limit 2
http_tokens defaulted to optional (IMDSv1+v2).http_tokens=required immediately breaks any app still using IMDSv1. Run the audit first; flip after no MetadataNoToken metric for 7 days.resource "aws_launch_template" "orders" { metadata_options { http_endpoint = "enabled" http_tokens = "required" # IMDSv2 http_put_response_hop_limit = 2 instance_metadata_tags = "enabled" } }
# .checkov.yml check: - CKV_AWS_79 # EC2 should require IMDSv2 - CKV_AWS_341 # LT hop_limit <= 2
Run a fleet-wide modify-instance-metadata-options in a maintenance window via SSM Automation document. No restart needed.
For containerized workloads with hop limit issues: set hop limit to 1 if IMDS shouldn't reach pods (most secure), or 2 if needed for ECS task role pickup.
Use instance metadata tags (instance_metadata_tags=enabled) so apps can read tags without IAM perms — great for cost-center decoration in logs.
CW dashboard tracks org-wide MetadataNoToken sum; alarm if any account has >0 over rolling 7 days.
Config rule ec2-imdsv2-check flags non-compliant instances/LTs.
SCP also denies ec2:ModifyInstanceMetadataOptions with HttpTokens=optional in request — can't weaken once enforced.
aws ec2 attach-volume fails: InvalidVolume.ZoneMismatch or just times out.gc-prod-data (us-east-1).eu-west-1.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Volume in different AZ than instance | describe-volumes AZ vs instance AZ |
| H2 | KMS key region mismatch | describe-volume KmsKeyId region |
| H3 | Snapshot not yet completed | describe-snapshots Progress |
| H4 | Snapshot not shared cross-acct | describe-snapshot-attribute create-volume-permission |
# 1. Volume + instance AZ aws --region eu-west-1 ec2 describe-volumes \ --volume-ids vol-0xx --query 'Volumes[].{AZ:AvailabilityZone,KMS:KmsKeyId}' aws --region eu-west-1 ec2 describe-instances \ --instance-ids i-0xx --query 'Reservations[].Instances[].Placement.AvailabilityZone' # 2. Re-create volume in correct AZ aws --region eu-west-1 ec2 create-volume \ --snapshot-id snap-dst --availability-zone eu-west-1c \ --volume-type gp3 --encrypted --kms-key-id alias/eu-data
# 3. Snapshot progress aws --region eu-west-1 ec2 describe-snapshots \ --snapshot-ids snap-dst --query 'Snapshots[].{P:Progress,S:State,K:KmsKeyId}' # 4. Cross-account share check aws --profile gc-prod-data ec2 describe-snapshot-attribute \ --snapshot-id snap-src --attribute createVolumePermission # 5. Attach aws --region eu-west-1 ec2 attach-volume \ --volume-id vol-new --instance-id i-0xx --device /dev/sdf
eu-west-1a (default), instance launched in eu-west-1c (chosen for capacity) → AZ mismatch.# SSM Automation document (Terraform-managed) resource "aws_ssm_document" "dr_restore_volume" { name = "GC-DR-RestoreEBS" document_type = "Automation" content = file("docs/dr-restore-ebs.yaml") }
# dr-restore-ebs.yaml (excerpt) parameters: SnapshotId: { type: String } TargetAz: { type: String, default: "eu-west-1c" } KmsKeyId: { type: String, default: "alias/eu-data" } mainSteps: - name: copy action: aws:executeAwsApi inputs: { Service: ec2, Api: CopySnapshot, ... } - name: wait action: aws:waitForAwsResourceProperty - name: create_volume action: aws:executeAwsApi inputs: { Api: CreateVolume, AvailabilityZone: "{{ TargetAz }}" }
# DLM lifecycle policy resource "aws_dlm_lifecycle_policy" "orders_data" { description = "orders-data daily snap + DR copy" state = "ENABLED" policy_details { schedule { cross_region_copy_rule { target = "eu-west-1" encrypted = true cmk_arn = aws_kms_alias.eu_data.target_key_arn retain_rule { interval = 30; interval_unit = "DAYS" } } } } }
Tag the snapshot with SourceVolumeAz; the DR doc reads it and tries to match the target AZ first.
Use EBS Multi-Attach (io1/io2) only with apps that support distributed locking; otherwise corruption.
Convert old gp2 to gp3 for free baseline IOPS bump — one modify-volume call, no downtime.
Quarterly DR game-day uses the SSM Automation doc end-to-end. Failure auto-creates Jira ticket.
Config rule: ebs-snapshot-public-restorable-check + custom rule for cross-region copy presence.
Backup & DR tag enforced via SCP — instances missing BackupPolicy tag get denied at launch.
-Xmx not set; defaults to 25% of memory.ebs-optimized=false.ebs-encryption-by-default=true.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Root volume too small / full | df -h / |
| H2 | JVM heap default too small/large | jcmd VM.flags |
| H3 | EBS burst credits exhausted (gp2) | CW BurstBalance metric |
| H4 | OOM-killer killed app, not Java OOM | dmesg | grep -i oom |
OutOfMemoryError; kernel logs oom-kill. Always check dmesg + journalctl -k.# 1. Disk + memory df -h / free -m swapon --show # 2. JVM flags sudo -u app jps -l sudo -u app jcmd <pid> VM.flags | grep -E 'MaxHeap|MinHeap|UseG1' sudo -u app jcmd <pid> VM.system_properties | grep mx
# 3. EBS burst (gp2) aws cloudwatch get-metric-statistics --namespace AWS/EBS \ --metric-name BurstBalance --dimensions Name=VolumeId,Value=vol-0xx \ --start-time -1h --end-time now --period 60 --statistics Minimum # 4. Linux OOM-kill sudo dmesg | grep -i 'killed process' sudo journalctl -k --since "1 hour ago" | grep -i oom
jvm.log redirected to /var/log.journalctl + log rotation lag means /var fills quickly. Pin a separate volume for /var/log.resource "aws_launch_template" "orders" { ebs_optimized = true block_device_mappings { device_name = "/dev/xvda" ebs { volume_size=30; volume_type="gp3"; iops=3000; throughput=125; encrypted=true } } block_device_mappings { device_name = "/dev/sdb" ebs { volume_size=20; volume_type="gp3"; encrypted=true } # /var/log } user_data = base64encode(file("ud.sh")) # mounts + JVM tuning }
# ud.sh excerpt mkfs.xfs /dev/nvme1n1 mount /dev/nvme1n1 /var/log echo "/dev/nvme1n1 /var/log xfs defaults,nofail 0 2" >> /etc/fstab # swap fallocate -l 2G /swapfile && chmod 600 /swapfile mkswap /swapfile && swapon /swapfile # JVM echo 'JAVA_OPTS="-Xms4g -Xmx5g -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError"' \ >> /etc/orders-api.env
volume_type=gp3 uniformly; gp2's burst credits are a frequent source of mysterious p99 spikes.For containerized apps: ditch swap; cap memory at the cgroup level. Java 11+ honors cgroup memory automatically.
Use tmpfs for /tmp with size cap — prevents tmp file bombs from filling root.
Pre-warm gp3: 3000 IOPS / 125 MB/s baseline is free; bump to 5000 IOPS at $0.005 per IOP-hour. Cheap p99 win.
CW agent installs diskspace+swap custom metrics; alarm on FilesystemUsedPct > 80 for / and /var/log.
SSM Compliance state pack — instance must report log-rotate active.
Pre-deploy unit test: spin instance with target user-data in sandbox; run app + chaos load; assert no OOM in first 5 min.
| # | Hypothesis | Disprove |
|---|---|---|
| H1 | No ASG lifecycle hook for terminate | describe-lifecycle-hooks |
| H2 | Hook exists but no handler subscribed | EventBridge target wired? |
| H3 | ALB deregistration_delay too long, instance gone before drain | TG attribute |
| H4 | Health check passes but pool keeps dead conns | Keep-alive timeout vs idle |
# 1. Lifecycle hooks on the ASG aws autoscaling describe-lifecycle-hooks --auto-scaling-group-name orders-asg # 2. ALB TG drain aws elbv2 describe-target-group-attributes --target-group-arn arn:...:targetgroup/orders/... # expect deregistration_delay.timeout_seconds <= 60 for most apps # 3. Spot interruption history aws ec2 describe-spot-instance-requests --filters Name=state,Values=closed
# 4. Listen for interruption from inside instance TOKEN=$(curl -s -X PUT http://169.254.169.254/latest/api/token \ -H "X-aws-ec2-metadata-token-ttl-seconds: 60") curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \ http://169.254.169.254/latest/meta-data/spot/instance-action # 5. Run NTH locally sudo systemctl status aws-node-termination-handler
EC2 Spot Instance Interruption Warning fires at T-2min, but the instance dies at T-0 regardless of ASG hook delay.resource "aws_autoscaling_lifecycle_hook" "terminate" { name = "orders-terminate" autoscaling_group_name = aws_autoscaling_group.orders.name lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING" default_result = "CONTINUE" heartbeat_timeout = 90 notification_target_arn = aws_sns_topic.lifecycle.arn role_arn = aws_iam_role.lifecycle.arn } resource "aws_lb_target_group" "orders" { ... deregistration_delay = 30 }
# NTH on each instance (DaemonSet for k8s, systemd for plain EC2) provisioner "file" { destination = "/etc/systemd/system/aws-node-termination-handler.service" content = file("nth.service") }
Set connection_termination=true on NLB TGs — existing flows are reset on deregister, faster recovery.
Mix on-demand base + spot for bursty workloads; SLO floor never on spot.
Use capacity-optimized-prioritized spot allocation if order matters; lowers interruption rate.
CW alarm: HTTPCode_ELB_5XX_Count > baseline + 3sigma during spot events.
Chaos game day: trigger fake interruption via describe-spot-fleet-request-history sim; assert 0 5xx.
Spot Placement Score > 7 required by Terraform pre-flight check.
$Default points to v3; v5 is latest.launch_template.version = "$Default" in Terraform.set-default-version happened in console.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | $Default not bumped to v5 | describe-launch-templates DefaultVersion |
| H2 | ASG uses pinned version, not $Default | describe-auto-scaling-groups LaunchTemplate.Version |
| H3 | Console-edited LT outside Terraform | diff Terraform state |
$Latest + use ASG instance_refresh.triggers=["launch_template"]. Terraform updates LT, ASG auto-rolls.# 1. LT versions aws ec2 describe-launch-template-versions \ --launch-template-id lt-0xx \ --query 'LaunchTemplateVersions[].{V:VersionNumber,D:DefaultVersion,I:LaunchTemplateData.ImageId}' # 2. ASG launch config aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names orders-asg \ --query 'AutoScalingGroups[].LaunchTemplate'
# 3. Force ASG to v5 explicit aws autoscaling update-auto-scaling-group \ --auto-scaling-group-name orders-asg \ --launch-template LaunchTemplateId=lt-0xx,Version=5 # 4. Trigger refresh aws autoscaling start-instance-refresh \ --auto-scaling-group-name orders-asg \ --preferences MinHealthyPercentage=90,InstanceWarmup=120
launch_template { version = "$Default" }.set-default-version.$Default a static string, doesn't track LT version drift — manual changes don't register as drift.resource "aws_autoscaling_group" "orders" { launch_template { id = aws_launch_template.orders.id version = aws_launch_template.orders.latest_version # pin explicit } instance_refresh { strategy = "Rolling" triggers = ["launch_template"] preferences { min_healthy_percentage = 90; instance_warmup = 120 } } }
latest_version attribute makes Terraform track every LT bump. Combined with instance refresh trigger, every PR rolls the fleet automatically.# nightly cron in CI terraform plan -refresh-only -detailed-exitcode # exit 2 = drift; raise issue
Tag every LT version with PromotedAt; promotion job blocks promotion of versions older than 30 days — forces fresh bakes.
Use checkpoint instance refresh: roll a small percent first, observe metrics, continue.
If you must use $Default, add a Lambda that asserts DefaultVersion == latest_version daily — closes the drift gap.
SCP denies ec2:ModifyLaunchTemplate in prod accounts — only CI role can change.
Tags on LT version (BuildSha, BuildAt) so post-mortems can identify which LT version a misbehaving instance came from.
EventBridge rule on ModifyLaunchTemplate outside CI role → alert.
aws secretsmanager get-secret-value.UnrecognizedClientException: The security token included in the request is invalid.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | IMDS not yet returning role creds | add wait-for-creds loop, observe |
| H2 | Network not up yet (race with eth0) | cloud-init cloud-init.target ordering |
| H3 | VPCe DNS not yet resolving | getent hosts secretsmanager.us-east-1... |
| H4 | Time skew — SigV4 fails | chronyc sources |
aws sts get-caller-identity as a probe. Loop until it succeeds; then call SecretsManager.# 1. Confirm the race sudo grep -E 'UnrecognizedClient|InvalidSignatureException' /var/log/cloud-init-output.log # 2. Test from instance after boot for i in 1 2 3; do aws sts get-caller-identity || echo retry; sleep 1 done
# 3. Wait pattern in user-data
until aws sts get-caller-identity >/dev/null 2>&1; do sleep 2; done
SECRET=$(aws secretsmanager get-secret-value --secret-id orders-prod \
--query SecretString --output text)
UnrecognizedClient by default.# ud.sh template #!/usr/bin/env bash set -euo pipefail TOK=$(curl -s -X PUT http://169.254.169.254/latest/api/token \ -H "X-aws-ec2-metadata-token-ttl-seconds: 60") for i in {1..30}; do if aws sts get-caller-identity >/dev/null 2>&1; then break; fi sleep 2 done SECRET=$(aws secretsmanager get-secret-value --secret-id ${secret_id} \ --query SecretString --output text)
wait-for-iam.sh across all repos. Single source for the wait loop — never re-derive.# EventBridge on EC2 running → SSM doc → pull secret
Use SSM Parameter Store for non-secret bootstrap config — same race avoidance, simpler IAM.
For Windows, the EC2Launch v2 task graph supports dependencies — ensure secret-pull task waits on aws-cli-ready.
Use a CW alarm on user-data failures — metric filter on cloud-init log shipping.
cloud-init unit ordering: After=cloud-init.target + Wants=instance-meta.target.
Bake the wait-for-iam loop into Image Builder component; user-data never repeats it.
Synthetic test: launch test instance every 4h, assert no UnrecognizedClient in logs.
Production, policy wants prod.Env values: [prod, stg, dev].Production.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Stale Terraform writes wrong case | grep Production |
| H2 | Tag policy not actually enforced | describe-policy + enforced_for |
| H3 | Auto-tagging Lambda overwrites | CloudTrail TagResource events |
@@assign, @@append, @@enforced_for). Always check effective-policy at the OU/account level — not the policy doc.# 1. Effective tag policy at account level aws --profile gc-mgmt organizations describe-effective-policy \ --policy-type TAG_POLICY --target-id 666666666666 # 2. Find non-compliant resources aws resourcegroupstaggingapi get-resources \ --tag-filters Key=Env,Values=Production # compare to allowed: prod / stg / dev
# 3. Bulk re-tag aws resourcegroupstaggingapi tag-resources \ --resource-arn-list arn:aws:ec2:...:instance/i-0xx \ --tags Env=prod # 4. Tag policy compliance summary aws --profile gc-audit config describe-compliance-by-config-rule \ --config-rule-names required-tags
Env values {prod, stg, dev}.Env=Production → non-compliant but not blocked.Production ≠ prod.locals {
tags = merge({
Env = "prod"
CostCenter = "ENG-100"
Owner = "orders-team"
}, var.extra_tags)
}
provider "aws" {
default_tags { tags = local.tags }
}
# tag policy strict mode { "tags": { "Env": { "tag_key": { "@@assign": "Env" }, "tag_value": { "@@assign": ["prod", "stg", "dev"] }, "enforced_for": { "@@assign": ["ec2:instance", "rds:db"] } } } }
enforced_for with resource type list converts tag policy from advisory to enforced — tag-aware ops will fail.Add a Lambda that auto-remediates: on TagResource event, lowercase the value if matches enum.
Use Resource Groups with tag filters as the source of truth for “all prod EC2” — surfaces non-compliant tags fast.
tflint custom rule: deny Env values not in [prod, stg, dev].
Tag policy in enforced_for mode + tflint: catch at PR-time and at API-time.
Quarterly compliance review pulled from Tag Policy compliance API.
One-line Terraform module everyone consumes: module "stdtags".
InsufficientFreeAddressesInSubnet.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Subnet truly full | describe-subnets AvailableIpAddressCount |
| H2 | EKS warm pool grabbing IPs | describe-network-interfaces by Description |
| H3 | Detached ENIs held by Lambda VPC / DLM | describe-network-interfaces Status=available |
| H4 | ECS tasks awaitingENI's, never deleted | ECS service events |
# 1. IP availability aws ec2 describe-subnets --filters Name=vpc-id,Values=vpc-0xx \ --query 'Subnets[].{ID:SubnetId,AZ:AvailabilityZone,Free:AvailableIpAddressCount,CIDR:CidrBlock}' \ --output table # 2. Who holds the ENIs? aws ec2 describe-network-interfaces \ --filters Name=subnet-id,Values=subnet-priv-use1a \ --query 'NetworkInterfaces[].{S:Status,D:Description,O:Attachment.InstanceOwnerId}' \ --output table
# 3. Free orphaned ENIs aws ec2 delete-network-interface --network-interface-id eni-0xx # 4. Add secondary CIDR + new subnet aws ec2 associate-vpc-cidr-block --vpc-id vpc-0xx \ --cidr-block 100.64.0.0/16 aws ec2 create-subnet --vpc-id vpc-0xx \ --cidr-block 100.64.10.0/22 --availability-zone us-east-1a # 5. Switch EKS to prefix delegation (more IPs/instance) kubectl set env -n kube-system ds aws-node ENABLE_PREFIX_DELEGATION=true
resource "aws_vpc_ipv4_cidr_block_association" "secondary" { vpc_id = aws_vpc.main.id cidr_block = "100.64.0.0/16" } resource "aws_subnet" "private_carrier" { count = 3 vpc_id = aws_vpc.main.id cidr_block = cidrsubnet("100.64.0.0/16", 6, count.index) availability_zone = local.azs[count.index] tags = merge(local.tags, { Tier="private-carrier", KubernetesCarrier="true" }) }
# enable prefix delegation in EKS resource "aws_eks_addon" "vpc_cni" { cluster_name = aws_eks_cluster.main.name addon_name = "vpc-cni" configuration_values = jsonencode({ env = { ENABLE_PREFIX_DELEGATION = "true" } }) }
Use VPC IPAM for centralized IP planning — alerts before exhaustion at OU scale.
For Lambda VPC, set EFS_DEPENDENCY_CHECK false + use VPC Lattice to bypass ENIs altogether.
Tag every detached ENI with OrphanCheck=true + Lambda cleans after 1h.
CW alarm on subnet free IPs < 20%.
Subnet sizing standard: never /27 in prod for EKS/ECS — minimum /22.
Quarterly IP capacity review per VPC — growth forecast vs IPAM.
OperationNotPermitted: The instance has termination protection.DisableApiTermination + ASG protected_from_scale_in.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | EC2 DisableApiTermination=true | describe-instance-attribute --attribute disableApiTermination |
| H2 | ASG instance protect from scale-in | describe-auto-scaling-instances |
| H3 | Lifecycle hook stuck waiting | describe-lifecycle-hooks |
DisableApiTermination; you must clear it on the protected instance OR use --skip-matching if AMI is identical anyway.# 1. Check both flags aws ec2 describe-instance-attribute --instance-id i-0xx \ --attribute disableApiTermination aws autoscaling describe-auto-scaling-instances \ --instance-ids i-0xx \ --query 'AutoScalingInstances[].ProtectedFromScaleIn'
# 2. Disable EC2 termination protection aws ec2 modify-instance-attribute --instance-id i-0xx \ --no-disable-api-termination # 3. Disable ASG scale-in protect aws autoscaling set-instance-protection \ --instance-ids i-0xx \ --auto-scaling-group-name orders-asg \ --no-protected-from-scale-in # 4. Resume refresh aws autoscaling resume-processes \ --auto-scaling-group-name orders-asg
DisableApiTermination=true on a known-good instance — to ensure ASG didn't kill it during diagnosis.OpsHold=true + nightly Lambda that warns + removes after 24h.# Tag-driven cleanup resource "aws_lambda_function" "ops_hold_cleanup" { function_name = "gc-ops-hold-cleanup" ... } resource "aws_cloudwatch_event_rule" "daily" { schedule_expression = "cron(0 8 * * ? *)" }
# Lambda body (excerpt) for inst in ec2.describe_instances(Filters=[{Name='tag:OpsHold',Values=['true']}]): age = now - inst.tags.get('OpsHoldSet') if age > 24h: ec2.modify_instance_attribute(InstanceId=inst.id, DisableApiTermination=False) ec2.modify_instance_attribute(InstanceId=inst.id, Tags={'OpsHold':'cleared'}) slack('cleared OpsHold on '+inst.id)
Instance refresh --skip-matching ignores instances already on the right LT version — a workaround when one stuck pet exists.
Use warm pool for fast scale-up; pool members aren't in service so refresh issues isolate.
SSM Automation doc GC-ClearOpsHold — one click clears all flags + tags.
EventBridge on ModifyInstanceAttribute with DisableApiTermination=true → tag instance + Slack.
Pre-deploy gate: describe-auto-scaling-group shows no instance with protection > 0; if any, fail deploy.
Runbook: don't use termination protection on cattle. Use ASG protected_from_scale_in for the rare case.
StatusCheckFailed_System never fired the recovery action.INSUFFICIENT_DATA for the last 6h.arn:aws:automate:us-east-1:ec2:recover.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | treat_missing_data not breaching | describe-alarms |
| H2 | Recovery action wrong ARN | compare to AWS-supplied recovery ARN |
| H3 | Instance type doesn't support recovery | check supported list |
| H4 | Alarm in different region | region check |
# 1. Alarm definition aws cloudwatch describe-alarms \ --alarm-names orders-recover-i-0xx \ --query 'MetricAlarms[].{T:TreatMissingData,A:AlarmActions,P:DatapointsToAlarm,E:EvaluationPeriods}' # 2. Last 6h metric data aws cloudwatch get-metric-statistics --namespace AWS/EC2 \ --metric-name StatusCheckFailed_System \ --dimensions Name=InstanceId,Value=i-0xx \ --start-time -6h --end-time now --period 60 --statistics Maximum
# 3. Fix the alarm aws cloudwatch put-metric-alarm \ --alarm-name orders-recover-i-0xx \ --metric-name StatusCheckFailed_System --namespace AWS/EC2 \ --dimensions Name=InstanceId,Value=i-0xx \ --statistic Maximum --period 60 --threshold 0 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 5 --datapoints-to-alarm 3 \ --treat-missing-data breaching \ --alarm-actions arn:aws:automate:us-east-1:ec2:recover
treat_missing_data=missing — default. When metrics stop, alarm goes INSUFFICIENT_DATA, no action.ALARM; can never get there with missing data treated as missing.breaching for failure-detection alarms.resource "aws_cloudwatch_metric_alarm" "recover" { for_each = toset(var.instance_ids) alarm_name = "recover-${each.value}" comparison_operator = "GreaterThanThreshold" evaluation_periods = 5 datapoints_to_alarm = 3 metric_name = "StatusCheckFailed_System" namespace = "AWS/EC2" period = 60 statistic = "Maximum" threshold = 0 treat_missing_data = "breaching" alarm_actions = ["arn:aws:automate:us-east-1:ec2:recover"] dimensions = { InstanceId = each.value } }
for_each — or use EC2 instance auto-recovery (default behavior) which doesn't require explicit alarms on supported types.resource "aws_instance" "x" { maintenance_options { auto_recovery = "default" } }
Use maintenance_options.auto_recovery=default — AWS handles it without alarms.
Pair recovery with a CW alarm on StatusCheckFailed_Instance → reboot action; covers OS hangs that aren't host failures.
For ASG, prefer health-check replace: set ASG health_check_type=ELB; ASG kills + replaces unhealthy instances faster than recovery.
CW alarm meta-monitor: alarm on any alarm in INSUFFICIENT_DATA > 30 min.
Config rule: cloudwatch-alarm-action-check + custom check on treat_missing_data.
Annual game day: simulate host failure (stop force) and assert recovery completes < 5 min.
aws ssm describe-instance-information returns empty.Target not found.AmazonSSMManagedInstanceCore.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | SSM agent not running | systemctl status amazon-ssm-agent |
| H2 | IAM role missing or perm | describe-instance-attribute --attribute iamInstanceProfile |
| H3 | VPC endpoint SG denies 443 from instance | SG ingress rules |
| H4 | VPCe DNS resolution off | private_dns_enabled |
| H5 | Time skew breaks SigV4 | chronyc sources |
curl -v https://ssm.us-east-1.amazonaws.com from the instance — if it resolves to a 10.x address, VPCe is in play. If a public IP, NAT path. Either should give 403 — that's good (TLS works).# 1. From instance (via console direct connect or get-system-log) sudo systemctl status amazon-ssm-agent sudo journalctl -u amazon-ssm-agent --no-pager | tail -50 sudo cat /var/log/amazon/ssm/amazon-ssm-agent.log | tail -100 # 2. Reach the endpoint getent hosts ssm.us-east-1.amazonaws.com curl -v https://ssm.us-east-1.amazonaws.com 2>&1 | head -10
# 3. From console aws ssm describe-instance-information \ --filters Key=InstanceIds,Values=i-0xx # empty -> agent never registered # 4. Inspect VPCe SG aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-0xx \ --query 'VpcEndpoints[].{S:Groups,P:PrivateDnsEnabled,SG:Groups}'
private_dns_enabled=false (someone disabled it for a debug last week).ssm.us-east-1.amazonaws.com resolved to public IP, instance had no NAT → agent couldn't register.vpce-xxx.ssm.us-east-1.vpce.amazonaws.com). SSM agent doesn't support that path.resource "aws_vpc_endpoint" "ssm" { vpc_id = aws_vpc.main.id service_name = "com.amazonaws.us-east-1.ssm" vpc_endpoint_type = "Interface" subnet_ids = local.private_subnets security_group_ids = [aws_security_group.vpce.id] private_dns_enabled = true # <-- must be true tags = local.tags } # repeat for ssmmessages, ec2messages
# SG for vpce resource "aws_security_group_rule" "vpce_ingress" { type = "ingress" from_port = 443 to_port = 443 protocol = "tcp" source_security_group_id = aws_security_group.workload.id security_group_id = aws_security_group.vpce.id }
SSM Fleet Manager can self-heal SSM agent state on managed instances — useful when agents drift.
Use SSM Default Host Management Configuration — auto-attaches the SSM role + agent to all EC2 in the account, no manual setup.
Ship CloudWatch Agent + SSM as a single Image Builder component; consistent across AMIs.
Config rule: ec2-instance-managed-by-systems-manager; non-compliant means missing.
Synthetic launch + SSM ping every hour from sandbox; alarm if registration takes > 5 min.
Pre-deploy checklist gates: SSM ping, VPCe health, IAM role attached.
health_check_type=ELB, grace 60s.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Reboot during reboot terminates | CW + ASG events same instance |
| H2 | Grace period too short | ASG health_check_grace_period |
| H3 | Health check uses /health that goes 503 in shutdown phase | app shutdown logs |
aws autoscaling describe-scaling-activities + CW alarm history side-by-side reveal who killed first.# 1. ASG events aws autoscaling describe-scaling-activities \ --auto-scaling-group-name orders-asg \ --max-records 5 # 2. CW alarm history aws cloudwatch describe-alarm-history \ --alarm-name orders-app-reboot --max-records 10
# 3. Disable the dual healing aws cloudwatch delete-alarms --alarm-names orders-app-reboot # 4. Tune ASG grace aws autoscaling update-auto-scaling-group \ --auto-scaling-group-name orders-asg \ --health-check-grace-period 180
# Drop the reboot alarm; rely on ASG ELB health # removed: aws_cloudwatch_metric_alarm.orders_app_reboot resource "aws_autoscaling_group" "orders" { health_check_type = "ELB" health_check_grace_period = 180 ... }
# app /health returns 200 when ready; 503 when draining # ALB target group draining waits 30s
If reboot is necessary (kernel state), use ASG standby: put instance in standby, reboot, return to service. ASG won't terminate during standby.
Decouple readiness from liveness. ALB health checks readiness; ASG checks liveness via instance status checks. Less false positives.
Add CW alarm on the count of InstanceRefresh events; pages if ASG is churning.
Audit: any CW alarm with action ec2:reboot or ec2:terminate on instances inside an ASG → warn.
Service runbook: one canonical healer per failure mode.
Game day: simulate hung app; assert single healer fires.
10.20.0.42:22 hangs.sg-bastion-ingress allows 22/tcp from pl-corp-onprem.| Subnet | DMZ subnet-dmz-use1a (10.20.0.0/24) |
| NACL | nacl-dmz applied to subnet |
| NACL outbound rule | recently “hardened”: allow 80/443 to 0/0; deny all else |
| SG | stateful (allows return automatically) |
| NACL | stateless (return must be explicitly allowed) |
| # | Hypothesis | Disprove |
|---|---|---|
| H1 | SG ingress missing 22 | describe-security-groups |
| H2 | NACL inbound 22 missing | describe-network-acls |
| H3 | NACL outbound ephemeral missing — SYN-ACK drops | NACL out rules |
| H4 | Asymmetric routing (TGW back path different) | RT inspection |
| H5 | Host firewall (Defender/iptables) | local netsh / iptables -L |
# 1. Reachability Analyzer (config-time check) aws ec2 create-network-insights-path \ --source $(jumphost-eni) --destination i-bastion \ --protocol TCP --destination-port 22 # 2. NACL outbound rules aws ec2 describe-network-acls \ --filters Name=association.subnet-id,Values=subnet-dmz-use1a \ --query 'NetworkAcls[].Entries[?Egress==`true`]' # 3. Live capture on bastion sudo tcpdump -ni any host 10.0.5.10 and port 22 -w /tmp/cap.pcap
# 4. Traffic Mirroring (cheeky) aws ec2 create-traffic-mirror-session \ --network-interface-id eni-bastion \ --traffic-mirror-target-id tmt-0xx \ --traffic-mirror-filter-id tmf-0xx --session-number 1 # 5. VPC Flow Log query aws logs filter-log-events --log-group /aws/vpc/flow \ --filter 'srcaddr=10.20.0.42 dstaddr=10.0.5.10 action=REJECT'
80/443 only in a recent control change.resource "aws_network_acl_rule" "dmz_out_ephem" { network_acl_id = aws_network_acl.dmz.id egress = true rule_number = 110 rule_action = "allow" protocol = "6" cidr_block = "10.0.0.0/8" # corp space from_port = 1024 to_port = 65535 } resource "aws_network_acl_rule" "dmz_out_https" { network_acl_id=aws_network_acl.dmz.id; egress=true rule_number=120; rule_action="allow" protocol="6"; cidr_block="0.0.0.0/0"; from_port=443; to_port=443 }
modules/nacl-tier always emits an ephemeral-out rule (1024-65535) to the corp prefix and to 0/0. The “hardening” PR that broke this should have failed the module's test suite.# tflint custom rule rule "aws_network_acl_must_have_ephemeral_egress" { enabled = true message = "NACL egress must include 1024-65535 (ephemeral)" }
Many shops just “don't use NACLs except for big strokes” (block known bad ports/CIDRs at the subnet boundary). Use SGs as the per-resource policy. Less footgun surface.
Linux ephemeral range varies. RHEL 8: 32768-60999. Older: 1024-65535. Windows: 49152-65535. NACLs need to cover all 1024-65535 for safety.
VPC Reachability Analyzer evaluates NACL config; it would have caught this before deploy — if anyone had run it.
Pre-deploy: every NACL change runs Reachability Analyzer for representative source/dest pairs. CI fail on REJECT.
VPC Flow Log alarm on action=REJECT to subnet-internal IPs over 5-min baseline.
Module-only NACLs — SCP forbids creating aws_network_acl outside the module path.
sg-12345 from another VPC (TGW peer): InvalidGroup.NotFound.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | SG ID typo | describe-security-groups |
| H2 | SG in different VPC; no RAM share | ram list-resources |
| H3 | VPCs in different regions | compare regions |
| H4 | Provider in Terraform points to wrong account | provider alias check |
# 1. Confirm SG exists where you think aws ec2 describe-security-groups --group-ids sg-bastion \ --query 'SecurityGroups[].{V:VpcId,O:OwnerId,N:GroupName}' # 2. Is it RAM-shared? aws ram list-resources --resource-owner SELF \ --resource-type ec2:SecurityGroup
# 3. Switch to prefix-list approach aws ec2 create-managed-prefix-list \ --address-family IPv4 --max-entries 50 \ --prefix-list-name pl-shared-svcs \ --entries 'Cidr=10.30.0.0/16,Description=shared' # 4. SG rule using prefix list aws ec2 authorize-security-group-ingress --group-id sg-orders \ --ip-permissions 'IpProtocol=tcp,FromPort=22,ToPort=22, PrefixListIds=[{PrefixListId=pl-0xx}]'
# Owner: gc-network repo resource "aws_ec2_managed_prefix_list" "shared_svcs" { name = "pl-shared-svcs" address_family = "IPv4" max_entries = 50 entry { cidr="10.30.0.0/16"; description="shared-svcs VPC" } tags = local.tags } resource "aws_ram_resource_share" "pl" { name = "gc-prefix-lists" principals = [for o in local.spoke_ous : o] } resource "aws_ram_resource_association" "pl_shared" { resource_share_arn = aws_ram_resource_share.pl.arn resource_arn = aws_ec2_managed_prefix_list.shared_svcs.arn }
# Consumer: gc-prod-app repo data "aws_ec2_managed_prefix_list" "shared" { name = "pl-shared-svcs" } resource "aws_security_group_rule" "orders_in" { type = "ingress" from_port = 22 to_port = 22 protocol = "tcp" prefix_list_ids = [data.aws_ec2_managed_prefix_list.shared.id] security_group_id = aws_security_group.orders.id }
Prefix lists count as 1 rule per prefix list ref in SG limits, regardless of entry count. Big quota saver.
SG references work cross-VPC for ALB target groups inside same account — useful for shared-services LBs.
For ECS tasks across services, share an SG via RAM and reference it directly — cleaner than maintaining IPs.
Custom checkov rule: forbid source_security_group_id with hardcoded sg-* across VPCs — force prefix list use.
Spoke account README documents pl-* names + how to consume.
RAM resource share reviewed quarterly; unused shares removed.
RulesPerSecurityGroupLimitExceededRulesPerSecurityGroupLimitExceeded.L-0EA8095F per SG.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | 60-rule cap reached | count rules per SG |
| H2 | Each microservice CIDR added separately | look for repeating /32s |
| H3 | Could be folded into prefix list | check duplicate descriptions |
describe-security-group-rules --filters Name=group-id,Values=sg-x | jq '.SecurityGroupRules | length' tells you exactly how close to the limit you are.# 1. Count rules aws ec2 describe-security-group-rules \ --filters Name=group-id,Values=sg-orders \ --query 'length(SecurityGroupRules)' # 2. Find duplicates / mergeable rules aws ec2 describe-security-group-rules \ --filters Name=group-id,Values=sg-orders \ --query 'SecurityGroupRules[].{P:IpProtocol,F:FromPort,T:ToPort,C:CidrIpv4}' \ | jq 'group_by(.C) | map({C:.[0].C, ports: map([.F,.T])})'
# 3. Quota aws service-quotas get-service-quota \ --service-code vpc --quota-code L-0EA8095F # 4. Request raise aws service-quotas request-service-quota-increase \ --service-code vpc --quota-code L-0EA8095F --desired-value 250
resource "aws_ec2_managed_prefix_list" "partners" { name = "pl-orders-partners" address_family = "IPv4" max_entries = 60 dynamic "entry" { for_each = var.partner_ips content { cidr = "${entry.value.cidr}/32" description = entry.value.name } } } resource "aws_security_group_rule" "orders_partners" { type = "ingress"; from_port=443; to_port=443; protocol="tcp" prefix_list_ids = [aws_ec2_managed_prefix_list.partners.id] security_group_id = aws_security_group.orders.id }
# Lambda updates pl-orders-partners from a CSV in S3 daily
Prefix list version increments on every change — SGs auto-reference latest. No SG churn.
Don't over-merge ports. 22-3389 looks tight but lets RDP through where you only wanted SSH. Be specific.
Use VPC Lattice for L7 service-to-service auth where possible — SGs only on ingress edge.
CW alarm: per-SG rule count > 50 (warn) / > 58 (alert) via Config rule.
Pre-merge tflint: detect >3 individual aws_security_group_rule with same protocol+port → suggest prefix list.
Quarterly SG cleanup: deduplicate, collapse, retire dead apps.
unhealthy with reason Health checks failed.curl localhost:8080/health returns 200).sg-alb-orders).sg-orders-task) ingress doesn't allow ALB SG.| # | Hypothesis | Disprove |
|---|---|---|
| H1 | Target SG missing ingress from ALB SG | describe-security-groups |
| H2 | Health-check path 404 | app log |
| H3 | Target port mismatch (TG 80 vs app 8080) | describe-target-groups |
| H4 | Target deregistered | describe-target-health |
sg-alb-edge; targets need only one allow rule.# 1. Target health aws elbv2 describe-target-health \ --target-group-arn arn:...:targetgroup/orders \ --query 'TargetHealthDescriptions[].{T:Target.Id,S:TargetHealth.State,R:TargetHealth.Reason,D:TargetHealth.Description}' # 2. ALB and target SGs aws elbv2 describe-load-balancers --names orders-alb \ --query 'LoadBalancers[].SecurityGroups' aws ec2 describe-security-groups --group-ids sg-orders-task \ --query 'SecurityGroups[].IpPermissions'
# 3. Add ingress from ALB SG aws ec2 authorize-security-group-ingress \ --group-id sg-orders-task \ --ip-permissions 'IpProtocol=tcp,FromPort=8080,ToPort=8080, UserIdGroupPairs=[{GroupId=sg-alb-orders}]' # 4. Re-check health (~30s) sleep 30 aws elbv2 describe-target-health --target-group-arn ...
amazon-elb/sg-AAA as ingress.sg-alb-orders not added; ALB-to-target traffic dropped.resource "aws_security_group" "alb_orders" { vpc_id = aws_vpc.main.id ingress { from_port=443; to_port=443; protocol="tcp"; cidr_blocks=["0.0.0.0/0"] } egress { from_port=0; to_port=0; protocol="-1"; cidr_blocks=["0.0.0.0/0"] } tags = local.tags } resource "aws_security_group_rule" "orders_task_in_alb" { type = "ingress" from_port = 8080 to_port = 8080 protocol = "tcp" source_security_group_id = aws_security_group.alb_orders.id security_group_id = aws_security_group.orders_task.id description = "ALB orders → task" }
modules/alb-target-sg-pair that emits both SGs as a unit. Module test: target SG must have ingress from the LB SG.# Some pre-flight: TG health-check hits port: # Make sure SG allows that exact port (might differ from app port)
Set ALB enable_cross_zone_load_balancing + tune deregistration_delay to 30s. Faster blue/green flips.
For NLB targets, SGs apply only when target = instance. Target = IP uses subnet/SG of the IP's ENI — trickier auditing.
NLB has SG support since 2023 — older NLBs may not have one attached. Add via set-security-groups.
Health-check probes from synthetic; alarm if any TG has > 0 unhealthy > 2 min.
Module test: deploy ALB+target, assert healthy in < 60s.
Custom Config rule: TG must have at least one ingress rule referencing parent ALB SG.