Right-sizing EC2: the audit that cut our bill 31%, The Cloud Ledger

Last quarter our EC2 bill crossed a number that made our finance lead schedule a meeting. I owned the audit. We didn't migrate anything to Graviton, we didn't rearchitect a single service, we just measured what each instance actually did and matched it to the right size. The bill dropped 31% in six weeks.

Here's the exact process I used, including the data I pulled and the mistakes that cost us a week.

Start with utilization, not the instance list

The temptation is to open the EC2 console, sort by hourly price, and start shrinking the expensive ones. That's backwards. A pricey r5.4xlarge running at 80% CPU is fine; a "cheap" m5.xlarge sitting at 4% is the real waste. I pulled 14 days of CloudWatch metrics, CPU, network, and (critically) memory via the CloudWatch agent, because EC2 does not report memory by default.

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time 2026-06-01T00:00:00Z \
  --end-time 2026-06-15T00:00:00Z \
  --period 3600 \
  --statistics Average Maximum

The pairing of Average and Maximum matters. An instance at 9% average but 70% peak handles a nightly batch job, downsizing it would tank that window. The ones worth cutting are flat-low on both.

Let Compute Optimizer do the first pass

AWS Compute Optimizer ingests this data for free and emits "Over-provisioned / Optimized / Under-provisioned" findings per instance, with a recommended type and projected savings. I treated it as a worklist, not gospel. It flagged 38 of our 112 instances as over-provisioned.

Compute Optimizer is only as good as its memory data. Without the CloudWatch agent reporting mem_used_percent, it assumes memory is fine and may recommend a downsize that triggers OOM kills in production.

I installed the agent fleet-wide first, then waited a full week for fresh memory metrics before trusting any recommendation that reduced RAM.

Where the savings actually came from

The 31% wasn't evenly spread. Breaking it down clarified where to spend future effort.

Change	Instances	Share of savings
Downsize one family step (e.g. `m5.2xlarge` → `m5.xlarge`)	27	~52%
Terminate idle / orphaned instances	9	~28%
Move to newer generation (`m5` → `m6i`)	14	~12%
Schedule non-prod shutdown nights/weekends	18	~8%

The orphaned instances stung. Three were dev boxes from people who'd left, two were a load test someone forgot to tear down. Tagging discipline would have caught all five.

Make downsizing safe and reversible

I batched changes by blast radius. Non-prod first, then prod stateless services behind auto scaling, then stateful last. Each change was a stop/modify/start, so I scripted it and verified the new type before restart.

import boto3

ec2 = boto3.client("ec2")
INSTANCE_ID = "i-0abc123def456"
NEW_TYPE = "m5.xlarge"

ec2.stop_instances(InstanceIds=[INSTANCE_ID])
ec2.get_waiter("instance_stopped").wait(InstanceIds=[INSTANCE_ID])

ec2.modify_instance_attribute(
    InstanceId=INSTANCE_ID,
    InstanceType={"Value": NEW_TYPE},
)

ec2.start_instances(InstanceIds=[INSTANCE_ID])
ec2.get_waiter("instance_running").wait(InstanceIds=[INSTANCE_ID])
print(f"{INSTANCE_ID} is now {NEW_TYPE}")

After each batch I watched CPU, memory, and p99 latency for 48 hours before moving on. Two instances had to bounce back up a size, both were JVM services where heap pressure showed up as GC pauses, not CPU. That's exactly why memory metrics are non-negotiable.

Keep it from creeping back

Right-sizing is not a one-time event; provisioning drifts back up as teams round up "to be safe." I set a monthly Compute Optimizer review, a budget alert in AWS Budgets, and a tagging policy enforced via SCP so every instance carries an owner and environment tag. The audit found the savings; the guardrails keep them.

Takeaways

Sort by utilization, not price, and always pull both average and peak over at least 14 days.
Install the CloudWatch agent for memory metrics before trusting any downsize recommendation.
Idle and orphaned instances are pure waste; enforce owner/environment tags via SCP to find them automatically.
Bank the win with recurring reviews and budget alerts, or provisioning will creep right back up.