Caching layers: ElastiCache patterns that scale, The Cloud Ledger

Our API's p99 latency was 380ms and climbing, and the database CPU graph told the whole story: the same product-catalog query was running thousands of times a second for data that changed maybe once an hour. We reached for ElastiCache, and within a day p99 dropped under 40ms. But getting caching right is less about adding Redis and more about choosing a pattern that survives traffic spikes and stale data.

Here are the ElastiCache patterns I actually use in production, when each applies, and the failure modes that bite you at scale.

Cache-aside is the default, and where it breaks

Cache-aside (lazy loading) is the pattern most teams start with: check the cache, on a miss read the database and populate the cache. It is simple and only caches data that is actually requested.

import redis, json

r = redis.Redis(host="my-cluster.cache.amazonaws.com", port=6379)

def get_product(product_id: str) -> dict:
    key = f"product:{product_id}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    product = db.query_product(product_id)   # cache miss: hit the DB
    r.set(key, json.dumps(product), ex=3600) # TTL of 1 hour
    return product

The failure mode is the thundering herd. When a hot key expires, every concurrent request misses simultaneously and stampedes the database, the exact spike you added caching to prevent. Two mitigations:

Add jitter to TTLs so keys do not all expire at the same instant.
Use a short-lived lock so only one request rebuilds the key while others briefly serve stale or wait.

Write-through for read-heavy, correctness-sensitive data

When stale reads are unacceptable (account balances, inventory counts), write-through updates the cache synchronously on every write. Reads are always fresh; writes pay a small latency tax.

Write-through trades write latency for read freshness. Use it when a stale read causes a real problem, not for data where a one-hour-old value is fine.

The trap is caching data that is rarely read. You pay the write cost on every update but get few cache hits in return. Pair write-through with a TTL so cold data eventually evicts itself rather than living in memory forever.

Redis vs Memcached, and cluster mode

ElastiCache offers both engines. The choice is usually straightforward:

Need	Engine
Data structures, sorted sets, pub/sub, persistence, replication	Redis (ElastiCache for Redis / Valkey)
Pure key-value, multi-threaded, simplest possible cache	Memcached

For anything beyond a flat key-value store, I default to Redis. For high availability, enable cluster mode with replicas across AZs and turn on Multi-AZ with automatic failover. A single-node cache is a single point of failure that takes your database down with it when it dies.

resource "aws_elasticache_replication_group" "cache" {
  replication_group_id = "api-cache"
  description          = "Product API cache"
  engine               = "redis"
  node_type            = "cache.r7g.large"
  num_node_groups      = 3        # shards for horizontal scale
  replicas_per_node_group = 1     # one replica per shard
  automatic_failover_enabled = true
  multi_az_enabled           = true
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
}

What to cache, and what not to

Caching everything is as wrong as caching nothing. My rules:

Cache expensive, frequently-read, slowly-changing data: catalog pages, config, computed aggregates.
Do not cache per-user data with low reuse, or anything where a stale value is a correctness or security problem (permissions, prices at checkout).
Always set a TTL. An unbounded cache fills memory and starts evicting unpredictably; an explicit TTL makes staleness a decision, not an accident.

Sizing and eviction

Pick an eviction policy deliberately. allkeys-lru is right for a pure cache. volatile-lru (evict only keys with a TTL) is right when the instance also holds data you must not lose. Watch the DatabaseMemoryUsagePercentage and Evictions CloudWatch metrics; rising evictions with a high hit rate means you are undersized and should scale up the node type or add shards.

Takeaways

Start with cache-aside, but defend the hot keys with TTL jitter and a rebuild lock to avoid thundering herds.
Use write-through only for correctness-sensitive data where stale reads cause real problems, and still pair it with a TTL.
Default to Redis with cluster mode, cross-AZ replicas, and automatic failover so the cache is not a single point of failure.
Always set a TTL and pick an eviction policy on purpose; watch Evictions to know when to scale.