Caching layers: ElastiCache patterns that scale
Cache-aside, write-through, and the failure modes that bite under load.
Our API's p99 latency was 380ms and climbing, and the database CPU graph told the whole story: the same product-catalog query was running thousands of times a second for data that changed maybe once an hour. We reached for ElastiCache, and within a day p99 dropped under 40ms. But getting caching right is less about adding Redis and more about choosing a pattern that survives traffic spikes and stale data.
Here are the ElastiCache patterns I actually use in production, when each applies, and the failure modes that bite you at scale.
Cache-aside is the default, and where it breaks
Cache-aside (lazy loading) is the pattern most teams start with: check the cache, on a miss read the database and populate the cache. It is simple and only caches data that is actually requested.
import redis, json
r = redis.Redis(host="my-cluster.cache.amazonaws.com", port=6379)
def get_product(product_id: str) -> dict:
key = f"product:{product_id}"
cached = r.get(key)
if cached:
return json.loads(cached)
product = db.query_product(product_id) # cache miss: hit the DB
r.set(key, json.dumps(product), ex=3600) # TTL of 1 hour
return product
The failure mode is the thundering herd. When a hot key expires, every concurrent request misses simultaneously and stampedes the database, the exact spike you added caching to prevent. Two mitigations:
- Add jitter to TTLs so keys do not all expire at the same instant.
- Use a short-lived lock so only one request rebuilds the key while others briefly serve stale or wait.
Write-through for read-heavy, correctness-sensitive data
When stale reads are unacceptable (account balances, inventory counts), write-through updates the cache synchronously on every write. Reads are always fresh; writes pay a small latency tax.
Write-through trades write latency for read freshness. Use it when a stale read causes a real problem, not for data where a one-hour-old value is fine.
The trap is caching data that is rarely read. You pay the write cost on every update but get few cache hits in return. Pair write-through with a TTL so cold data eventually evicts itself rather than living in memory forever.
Redis vs Memcached, and cluster mode
ElastiCache offers both engines. The choice is usually straightforward:
| Need | Engine |
|---|---|
| Data structures, sorted sets, pub/sub, persistence, replication | Redis (ElastiCache for Redis / Valkey) |
| Pure key-value, multi-threaded, simplest possible cache | Memcached |
For anything beyond a flat key-value store, I default to Redis. For high availability, enable cluster mode with replicas across AZs and turn on Multi-AZ with automatic failover. A single-node cache is a single point of failure that takes your database down with it when it dies.
resource "aws_elasticache_replication_group" "cache" {
replication_group_id = "api-cache"
description = "Product API cache"
engine = "redis"
node_type = "cache.r7g.large"
num_node_groups = 3 # shards for horizontal scale
replicas_per_node_group = 1 # one replica per shard
automatic_failover_enabled = true
multi_az_enabled = true
at_rest_encryption_enabled = true
transit_encryption_enabled = true
}
What to cache, and what not to
Caching everything is as wrong as caching nothing. My rules:
- Cache expensive, frequently-read, slowly-changing data: catalog pages, config, computed aggregates.
- Do not cache per-user data with low reuse, or anything where a stale value is a correctness or security problem (permissions, prices at checkout).
- Always set a TTL. An unbounded cache fills memory and starts evicting unpredictably; an explicit TTL makes staleness a decision, not an accident.
Sizing and eviction
Pick an eviction policy deliberately. allkeys-lru is right for a pure cache. volatile-lru (evict only keys with a TTL) is right when the instance also holds data you must not lose. Watch the DatabaseMemoryUsagePercentage and Evictions CloudWatch metrics; rising evictions with a high hit rate means you are undersized and should scale up the node type or add shards.
Takeaways
- Start with cache-aside, but defend the hot keys with TTL jitter and a rebuild lock to avoid thundering herds.
- Use write-through only for correctness-sensitive data where stale reads cause real problems, and still pair it with a TTL.
- Default to Redis with cluster mode, cross-AZ replicas, and automatic failover so the cache is not a single point of failure.
- Always set a TTL and pick an eviction policy on purpose; watch
Evictionsto know when to scale.