I ran the exact audit from this post on our fleet and found three over-provisioned ASGs we’d been paying for since launch. Trimmed roughly 28% off compute in an afternoon. The “measure before you cut” framing saved us from a self-inflicted incident.
Re: “Right-sizing EC2: the audit that cut our bill 31%”Kind words
What readers say
Notes from engineers, SREs, and builders who put these posts to work. Got your own? Leave a testimonial →
NAT Gateway data-processing charges were quietly our second-largest line item and nobody could explain why. This walkthrough pointed me straight at the cross-AZ chatter. VPC endpoints for S3 and ECR cut it almost overnight.
Re: “The hidden cost of NAT Gateways (and how to cut it)”Easily the clearest end-to-end RAG writeup I’ve found for AWS. The chunking and retrieval trade-offs section matched exactly the problems we hit in production, wish I’d read it two sprints earlier.
Re: “RAG on AWS: Bedrock Knowledge Bases end to end”We were about to cargo-cult EKS because “everyone uses Kubernetes.” This decision framework gave us the language to push back. We landed on ECS + Fargate and shipped a month sooner.
Re: “Picking compute: ECS, EKS, or Fargate”Migrated our stateless services to Graviton over a weekend after reading this. ~20% cheaper and measurably faster on our workloads. The “easiest cost cut you’re not taking” headline is not clickbait, it’s just true.
Re: “Graviton: the easiest 20% cost cut you’re not taking”Finally deleted the last long-lived AWS access key from our CI. The OIDC trust-policy snippet worked first try, which never happens. This should be the default way every team wires GitHub Actions to AWS.
Re: “GitHub Actions to AWS with OIDC, no long-lived keys”I have hated single-table design for years. This is the first explanation that started from access patterns instead of dogma, and it finally clicked. Our table is leaner and our queries are cheaper.
Re: “DynamoDB single-table design for people who hate it”We used this as the blueprint for restructuring into a multi-account org with SCP guardrails. The “set this up before your first prod deploy” advice is something I now repeat to every new team.
Re: “A sane multi-account setup with AWS Organizations”The cost-crossover analysis between Bedrock and self-hosting is exactly the math my leadership kept asking me for. I dropped the table straight into a decision doc and we stopped over-engineering a GPU cluster we didn’t need.
Re: “Bedrock vs self-hosting LLMs: a cost breakdown”Practical and honest about where Spot bites you. The interruption-handling patterns let us move batch and CI workloads over with confidence. Real savings, no 3am surprises so far.
Re: “Spot Instances in production: where they pay off”The S3 + DynamoDB state-locking layout in this post is now our team standard. No more “who applied what” mysteries on a Friday afternoon. Clear, correct, and copy-pasteable.
Re: “Managing Terraform state with S3 and DynamoDB locking”I’ve audited a lot of S3 buckets and this checklist catches the misconfigurations that actually cause breaches, not just the theoretical ones. Sent it to every team that owns a bucket.
Re: “Securing S3 buckets: a checklist that actually helps”A genuinely complete reference architecture, ingestion, retrieval, generation, and the evaluation harness that everyone skips. We benchmarked against it and closed two real quality gaps.
Re: “Building a production RAG system on AWS, start to finish”Did a post help you?
I’d love to hear about it, and selected notes get featured right here.