Observability on AWS: metrics, logs, and traces together
Wiring CloudWatch, X-Ray, and OpenTelemetry into one coherent view.
The first time I debugged a latency spike across a microservice fleet on AWS, I had three browser tabs open: CloudWatch metrics in one, a Logs Insights query in another, and X-Ray in a third. I spent more time correlating timestamps by hand than actually finding the problem. Metrics told me that p99 had doubled, logs told me what errored, and traces told me where the time went, but nothing tied them together.
Getting the three pillars to work as one system, not three silos, is what separates real observability from a dashboard collection. Here's how I stitch metrics, logs, and traces together on AWS.
The three pillars and where they live
- Metrics, CloudWatch metrics and, increasingly, Managed Service for Prometheus (AMP) for high-cardinality workloads.
- Logs, CloudWatch Logs, queried with Logs Insights, or shipped to OpenSearch for heavier analytics.
- Traces, AWS X-Ray, or OpenTelemetry traces collected via the ADOT (AWS Distro for OpenTelemetry) Collector.
The trap is treating each as a destination. The win is treating them as three views of the same request, joined by a shared trace ID.
Correlation is the whole game
If a log line and a trace span don't share an identifier, you can't pivot between them. The fix is to inject the trace ID into every structured log line. With X-Ray, the _X_AMZN_TRACE_ID is available in the environment; with OTel, you pull it from the active span context.
import logging, json
from opentelemetry import trace
class TraceContextFilter(logging.Filter):
def filter(self, record):
span = trace.get_current_span()
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, "032x") if ctx.trace_id else "-"
return True
logger = logging.getLogger("app")
logger.addFilter(TraceContextFilter())
def emit(msg, **fields):
logger.info(json.dumps({"msg": msg, "trace_id": logger_trace_id(), **fields}))
Now a Logs Insights query can find every log line for a slow trace:
fields @timestamp, msg, trace_id, latency_ms
| filter trace_id = "4bf92f3577b34da6a3ce929d0e0e4736"
| sort @timestamp asc
Collect once with OpenTelemetry
Rather than instrument three times, I run the ADOT Collector and fan out from a single pipeline. The collector receives OTLP, then exports metrics to AMP, traces to X-Ray, and logs onward.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 10s
exporters:
awsxray:
region: us-east-1
prometheusremotewrite:
endpoint: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-abc/api/v1/remote_write"
auth:
authenticator: sigv4auth
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [awsxray]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
One agent, one config, three destinations. Applications speak OTLP and stay vendor-neutral.
You don't have an observability problem until you try to answer a question you didn't dashboard in advance. Design for the unknown question, not the known graph.
What to actually alarm on
Metrics are cheap to over-collect and expensive to alert on noisily. I anchor alerts on the four golden signals, latency, traffic, errors, saturation, and use CloudWatch composite alarms to suppress downstream noise when an upstream dependency is already firing.
| Signal | Source | Alarm basis |
|---|---|---|
| Latency | X-Ray / ALB metrics | p99 over rolling 5 min |
| Errors | Logs metric filter | 5xx rate > 1% |
| Saturation | CloudWatch / AMP | CPU/mem/queue depth |
| Traffic | RequestCount | anomaly detection band |
Cost discipline
Observability bills sneak up on you. Log ingestion is the usual culprit. I set retention policies aggressively (7 days hot, archive to S3 via subscription filter), sample traces at 5-10% with a higher rate for errors, and drop debug-level logs in production. That alone cut my CloudWatch Logs bill by more than half without losing the signal I needed for incidents.
Takeaways
- Inject a shared trace ID into structured logs so you can pivot from a metric spike to the exact log lines and spans.
- Run a single OpenTelemetry/ADOT Collector and fan out to X-Ray, AMP, and CloudWatch instead of instrumenting three times.
- Alarm on the four golden signals and use composite alarms to cut downstream noise.
- Control cost with short log retention, S3 archival, and error-biased trace sampling.