Right-sizing EC2: the audit that cut our bill 31%
A repeatable process for finding over-provisioned instances and acting on it without breaking prod.
Last quarter our EC2 bill crossed a number that made our finance lead schedule a meeting. I owned the audit. We didn't migrate anything to Graviton, we didn't rearchitect a single service, we just measured what each instance actually did and matched it to the right size. The bill dropped 31% in six weeks.
Here's the exact process I used, including the data I pulled and the mistakes that cost us a week.
Start with utilization, not the instance list
The temptation is to open the EC2 console, sort by hourly price, and start shrinking the expensive ones. That's backwards. A pricey r5.4xlarge running at 80% CPU is fine; a "cheap" m5.xlarge sitting at 4% is the real waste. I pulled 14 days of CloudWatch metrics, CPU, network, and (critically) memory via the CloudWatch agent, because EC2 does not report memory by default.
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time 2026-06-01T00:00:00Z \
--end-time 2026-06-15T00:00:00Z \
--period 3600 \
--statistics Average Maximum
The pairing of Average and Maximum matters. An instance at 9% average but 70% peak handles a nightly batch job, downsizing it would tank that window. The ones worth cutting are flat-low on both.
Let Compute Optimizer do the first pass
AWS Compute Optimizer ingests this data for free and emits "Over-provisioned / Optimized / Under-provisioned" findings per instance, with a recommended type and projected savings. I treated it as a worklist, not gospel. It flagged 38 of our 112 instances as over-provisioned.
Compute Optimizer is only as good as its memory data. Without the CloudWatch agent reporting mem_used_percent, it assumes memory is fine and may recommend a downsize that triggers OOM kills in production.
I installed the agent fleet-wide first, then waited a full week for fresh memory metrics before trusting any recommendation that reduced RAM.
Where the savings actually came from
The 31% wasn't evenly spread. Breaking it down clarified where to spend future effort.
| Change | Instances | Share of savings |
|---|---|---|
Downsize one family step (e.g. m5.2xlarge → m5.xlarge) | 27 | ~52% |
| Terminate idle / orphaned instances | 9 | ~28% |
Move to newer generation (m5 → m6i) | 14 | ~12% |
| Schedule non-prod shutdown nights/weekends | 18 | ~8% |
The orphaned instances stung. Three were dev boxes from people who'd left, two were a load test someone forgot to tear down. Tagging discipline would have caught all five.
Make downsizing safe and reversible
I batched changes by blast radius. Non-prod first, then prod stateless services behind auto scaling, then stateful last. Each change was a stop/modify/start, so I scripted it and verified the new type before restart.
import boto3
ec2 = boto3.client("ec2")
INSTANCE_ID = "i-0abc123def456"
NEW_TYPE = "m5.xlarge"
ec2.stop_instances(InstanceIds=[INSTANCE_ID])
ec2.get_waiter("instance_stopped").wait(InstanceIds=[INSTANCE_ID])
ec2.modify_instance_attribute(
InstanceId=INSTANCE_ID,
InstanceType={"Value": NEW_TYPE},
)
ec2.start_instances(InstanceIds=[INSTANCE_ID])
ec2.get_waiter("instance_running").wait(InstanceIds=[INSTANCE_ID])
print(f"{INSTANCE_ID} is now {NEW_TYPE}")
After each batch I watched CPU, memory, and p99 latency for 48 hours before moving on. Two instances had to bounce back up a size, both were JVM services where heap pressure showed up as GC pauses, not CPU. That's exactly why memory metrics are non-negotiable.
Keep it from creeping back
Right-sizing is not a one-time event; provisioning drifts back up as teams round up "to be safe." I set a monthly Compute Optimizer review, a budget alert in AWS Budgets, and a tagging policy enforced via SCP so every instance carries an owner and environment tag. The audit found the savings; the guardrails keep them.
Takeaways
- Sort by utilization, not price, and always pull both average and peak over at least 14 days.
- Install the CloudWatch agent for memory metrics before trusting any downsize recommendation.
- Idle and orphaned instances are pure waste; enforce owner/environment tags via SCP to find them automatically.
- Bank the win with recurring reviews and budget alerts, or provisioning will creep right back up.