Logs vs Metrics vs Traces: When to Use Each
Logs vs Metrics vs Traces: When to Use Each
Section titled “Logs vs Metrics vs Traces: When to Use Each”Observability signals are often confused because they overlap in what they can technically capture. But they serve different purposes, and using the wrong signal leads to slow queries, expensive storage, or blind spots in production. This guide explains when to use each.
This article is part of our broader approach to reliability and observability. For how we apply these principles operationally, see our Reliability & Availability documentation.
What they’re best for
Logs are event snapshots with full context. Use them when you need to debug specific failures, understand user actions, or reconstruct exactly what happened at a moment in time.
Examples:
- “Why did this payment fail?”
- “What error did user X see?”
- “Show me the stack trace for this crash”
Logs excel at answering what happened with high fidelity. They include error messages, stack traces, request IDs, user metadata—everything you’d want when debugging a specific incident.
What they’re bad at
Logs don’t aggregate well. Answering “How many errors happened today?” by scanning millions of log entries is slow and expensive.
One common misuse
Using logs for trend analysis. Queries like “Count errors over the last week” should use metrics, not log aggregation. Log-based analytics tools work, but they’re slower and costlier than purpose-built metric systems.
Metrics
Section titled “Metrics”What they’re best for
Metrics are pre-aggregated counters, gauges, and histograms. Use them for dashboards, alerts, and trend analysis where you care about patterns, not individual events.
Examples:
- “How many requests per second?”
- “What’s the 95th percentile latency?”
- “Alert when CPU > 80%”
Metrics are cheap to store and query because they discard detail. Instead of saving every request, you store a count. Instead of every latency value, you store percentiles.
What they’re bad at
Metrics don’t tell you why something happened. If your error rate spikes, metrics show the spike but not which endpoints failed or what the errors were. You need logs or traces to debug.
One common misuse
Adding too many cardinality dimensions. Tagging metrics with user IDs, request IDs, or transaction IDs creates millions of unique time series, exploding storage costs and query times. Metrics work best with low-cardinality tags (service name, endpoint, region).
Traces
Section titled “Traces”What they’re best for
Traces show request flow across services. Use them to debug latency, understand dependencies, or find where a distributed request failed.
Examples:
- “Why is this endpoint slow?”
- “Which service is causing timeouts?”
- “Show me the call path for request X”
Traces answer where in a distributed system something went wrong. They link spans across services, showing you the exact sequence of calls and their durations.
What they’re bad at
Traces are expensive at scale. Running analytics like “Find all traces with latency > 1s” requires sampling or indexing.
One common misuse
Tracing everything without sampling. Full tracing at high request volumes creates massive data pipelines. Most teams sample traces (1-10%) and only keep slow or failed requests at 100%.
Decision Guide
Section titled “Decision Guide”| Question | Use This |
|---|---|
| How many errors happened? | Metrics |
| Why did this request fail? | Logs |
| Which service is slow? | Traces |
| Is the database overloaded? | Metrics |
| What did user X do? | Logs |
| Where is latency coming from? | Traces |
| Alert on high CPU | Metrics |
| Debug a crash | Logs |
| Visualize request flow | Traces |
Minimal Sane Setup
Section titled “Minimal Sane Setup”Most teams should start with:
Metrics
- Request rate, error rate, latency (per service, per endpoint)
- Infrastructure: CPU, memory, disk, network
- Business metrics: signups, purchases, active users
- Alert on rate-of-change, not absolute thresholds
Logs
- Errors and exceptions (always)
- Authentication events (login, logout, failures)
- Critical business events (payments, orders)
- Sample debug logs at 1-10% in production
Traces
- Sample 1-5% of successful requests
- Trace 100% of errors and slow requests (> P95 latency)
- Focus on critical paths (checkout, API requests, auth flows)
This baseline gives you dashboards (metrics), debugging context (logs), and latency insight (traces) without overwhelming your systems or budget.
Avoid these early mistakes:
- Logging everything at DEBUG level in production (expensive, noisy)
- Metrics with unbounded cardinality (user IDs, session IDs)
- Tracing 100% of traffic without sampling
When to Use Multiple Signals Together
Section titled “When to Use Multiple Signals Together”The signals complement each other:
- Metric spike → Check logs for error details → Use traces to find the slow service
- Slow trace → Check logs for database queries → Metrics show DB connection pool exhaustion
- Error logs → Metrics show which endpoints → Traces show upstream service timeouts
Think of metrics as your dashboard, logs as your debugger, and traces as your map. Each signal serves different questions.
Common Patterns
Section titled “Common Patterns”High error rate (metrics) + specific errors (logs)
Metrics alert you that errors spiked. Logs show you the actual error messages and stack traces.
Slow endpoint (metrics) + request path (traces)
Metrics show P95 latency increased. Traces show which downstream service is adding latency.
Failed deployment (all three)
Metrics show request rate dropped, error rate spiked. Logs show new exceptions. Traces show timeouts to a new service version.
Failure Modes
Section titled “Failure Modes”Logs-only systems
You can debug individual failures but lack trend visibility for alerting and capacity planning.
Metrics-only systems
You see that something is wrong but not what or why. Requires additional instrumentation to debug.
Traces-only systems
You see latency and call paths but lack error context for root cause analysis.
What’s the difference between logs and traces?
Logs capture individual events with full context (errors, messages, stack traces), while traces track requests across multiple services showing call paths and latency. Use logs to debug “why it failed,” traces to debug “where it’s slow.”
Should I use metrics or logs for error tracking?
Use both. Metrics alert you when error rates spike (fast, cheap queries). Logs give you the actual error messages and stack traces needed to fix the issue.
How much should I sample traces?
Start with 1-5% sampling for normal requests, 100% for errors and slow requests (above P95 latency). Adjust based on traffic volume and storage costs.
Can I skip metrics and just use logs?
No. Aggregating logs for dashboards and alerts is slow and expensive. Metrics are purpose-built for trend analysis, with orders of magnitude better performance.
Final Thoughts
Section titled “Final Thoughts”Observability isn’t about choosing one signal—it’s about using the right signal for the question.
- Metrics for trends and alerts
- Logs for debugging and context
- Traces for distributed latency
Start with the minimal setup, sample aggressively, and add detail only where you need it. Many teams over-log and under-invest in metrics early. Balancing all three signals leads to faster debugging and lower operational costs.