Multi-region architectures: what you really need
Active-active, active-passive, or just backups? Matching resilience to actual requirements.
"We need to be multi-region" is one of those phrases that sounds like a requirement but is almost always a wish. When I dig in, the actual need is usually something narrower: survive an AZ failure, meet a regional data-residency law, or shave 80ms off latency for users in Europe. Each of those has a different, and usually cheaper, answer than full active-active.
So before anyone touches a second region, I make the team write down two numbers and answer one question. That exercise kills most multi-region projects, and saves the ones that are real.
Start with RTO and RPO, not regions
Everything flows from two targets:
- RTO (Recovery Time Objective), how long can you be down?
- RPO (Recovery Point Objective), how much data can you afford to lose?
If your RTO is "4 hours" and RPO is "15 minutes," you do not need active-active. You need backups and a tested restore. People reach for the most expensive pattern when their own targets call for the cheapest one.
| Pattern | RTO | RPO | Relative cost |
|---|---|---|---|
| Backup & restore | Hours | Hours | ~1x |
| Pilot light | 10s of min | Minutes | ~1.2x |
| Warm standby | Minutes | Seconds | ~1.5x |
| Active-active | Near zero | Near zero | ~2x+ |
The cost of multi-region isn't the duplicate infrastructure, it's the permanent engineering tax of keeping two regions in sync and tested. Active-active doubles your operational surface forever.
The hard part is always state
Stateless compute is trivially multi-region, you just deploy it twice. The difficulty is data, and AWS gives you a few honest options:
- DynamoDB global tables, multi-active, last-writer-wins conflict resolution. Genuinely active-active, but you must design for the LWW semantics.
- Aurora Global Database, one writer region, read replicas elsewhere with typically sub-second replication lag and ~1-minute managed failover. This is active-passive, full stop.
- S3 Cross-Region Replication, asynchronous; great for assets and backups, not for strong consistency.
The uncomfortable truth: unless your datastore is DynamoDB global tables, you almost certainly have a single writer region, which means you're active-passive whether you admit it or not. Pretending otherwise leads to split-brain.
Failover with Route 53 health checks
For active-passive, Route 53 failover routing does the heavy lifting. You attach a health check to the primary and Route 53 swings DNS to the secondary when it fails:
resource "aws_route53_health_check" "primary" {
fqdn = "api-use1.example.com"
type = "HTTPS"
resource_path = "/healthz"
failure_threshold = 3
request_interval = 10
}
resource "aws_route53_record" "primary" {
zone_id = var.zone_id
name = "api.example.com"
type = "A"
set_identifier = "primary-use1"
health_check_id = aws_route53_health_check.primary.id
failover_routing_policy { type = "PRIMARY" }
alias {
name = aws_lb.use1.dns_name
zone_id = aws_lb.use1.zone_id
evaluate_target_health = true
}
}
Remember DNS TTLs add to your real-world RTO. A 60-second TTL means clients can keep hitting the dead region for up to a minute after failover, budget for it.
Pilot light: the pattern I recommend most
For teams whose RTO is tens of minutes, pilot light hits the sweet spot. You keep data continuously replicated to the standby region (Aurora Global, DynamoDB, S3 CRR) but keep compute scaled to zero or near-zero. On disaster, you scale up the pre-baked infrastructure, defined in the same Terraform, and flip Route 53.
It costs roughly 20% more than single-region (you pay for storage replication and a minimal footprint) versus 2x+ for active-active, and the recovery is fully scripted. The one rule: you must test the failover on a schedule. An untested DR plan is a fiction.
Takeaways
- Define RTO and RPO first; they dictate the pattern. Most "multi-region" needs are satisfied by backup/restore or pilot light.
- State is the hard problem, unless you're on DynamoDB global tables, you have a single writer and you're active-passive.
- Account for DNS TTLs in your real RTO; failover isn't instant.
- Pilot light gives strong recovery at ~1.2x cost, but only if you actually test the failover regularly.