Spot Instances in production: where they pay off, The Cloud Ledger

"Spot Instances are 90% off but they can disappear" is the entire mental model most people have, and it leads to two equally wrong conclusions: either "too risky, never in prod" or "free compute, put everything on it." I've run Spot in production for years now. The truth is narrower and more useful, Spot is excellent for a specific class of workload and quietly dangerous for another, and the line between them is about statefulness and interruptibility, not about "prod vs. not prod."

What Spot really is

Spot Instances are spare EC2 capacity sold at a steep discount, commonly 60-90% off On-Demand, with one catch: AWS can reclaim them with a two-minute warning when it needs the capacity back. That two-minute notice arrives via instance metadata and an EventBridge event. The discount is real and large; the question is only whether your workload can gracefully absorb a reclaim.

Spot is not "cheap On-Demand." It's a different contract: you trade a guarantee of uptime for a large discount. Design for the interruption and it pays off; ignore it and it eventually ruins your day.

Where it clearly pays off

Stateless web/API tiers behind a load balancer, if an instance vanishes, the ALB drains it and others absorb the traffic. This is the highest-value prod use.
Containerized workloads on EKS/ECS, the scheduler reschedules pods/tasks onto surviving nodes.
Batch and data processing, Spark, ETL, CI runners, video encoding: work is chunked and retryable.
ML training that checkpoints, resume from the last checkpoint after a reclaim.

Where it bites

Stateful singletons, a primary database, a stateful leader, anything where losing the node loses data or quorum.
Long, uncheckpointed jobs, a 6-hour job with no checkpoints can be killed at hour 5 and start over.
Hard latency SLAs during capacity crunches, when Spot capacity tightens, you can lose many instances at once.

The pattern that makes it safe: blended capacity

The mistake is running 100% Spot. In production I run a base of On-Demand (or Savings Plan / Reserved) capacity that can absorb my floor of traffic, and layer Spot on top for the rest. With EC2 Auto Scaling mixed instances policies you express this directly, and crucially you diversify across many instance types so a single type's capacity shortage doesn't take out your whole fleet:

MixedInstancesPolicy:
  InstancesDistribution:
    OnDemandBaseCapacity: 2          # always-on floor
    OnDemandPercentageAboveBaseCapacity: 20   # 80% of the rest is Spot
    SpotAllocationStrategy: capacity-optimized # pick deepest pools
  LaunchTemplate:
    Overrides:                       # diversify across pools
      - InstanceType: m6i.large
      - InstanceType: m5.large
      - InstanceType: m5a.large
      - InstanceType: m6a.large

The capacity-optimized strategy launches into the pools with the most spare capacity, which empirically interrupts far less often than chasing the absolute lowest price.

Handle the two-minute warning

Graceful handling turns an interruption from an outage into a non-event. Catch the rebalance/interruption signal and drain. From instance metadata:

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 300")

curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action
# 404 = no interruption pending; a JSON body with "action":"terminate" = drain now

On EKS, run the AWS Node Termination Handler so this cordon-and-drain happens automatically; on ECS, enable Spot draining so tasks are rescheduled before the node dies.

Takeaways

Spot trades a guaranteed uptime for a 60-90% discount, the deciding factor is interruptibility, not whether it's "prod."
Great for stateless tiers, containers, batch, and checkpointed training; avoid for stateful singletons and long uncheckpointed jobs.
Never run 100% Spot in prod: keep an On-Demand base, layer Spot on top, and diversify across many instance types with capacity-optimized.
Always handle the two-minute interruption notice with cordon-and-drain (Node Termination Handler / ECS Spot draining).