Skip to content
NewsDataHub NewsDataHub Learning Center

Logs vs Metrics vs Traces: When to Use Each

Logs vs Metrics vs Traces: When to Use Each

Section titled “Logs vs Metrics vs Traces: When to Use Each”

Observability signals are often confused because they overlap in what they can technically capture. But they serve different purposes, and using the wrong signal leads to slow queries, expensive storage, or blind spots in production. This guide explains when to use each.

This article is part of our broader approach to reliability and observability. For how we apply these principles operationally, see our Reliability & Availability documentation.

What they’re best for

Logs are event snapshots with full context. Use them when you need to debug specific failures, understand user actions, or reconstruct exactly what happened at a moment in time.

Examples:

  • “Why did this payment fail?”
  • “What error did user X see?”
  • “Show me the stack trace for this crash”

Logs excel at answering what happened with high fidelity. They include error messages, stack traces, request IDs, user metadata—everything you’d want when debugging a specific incident.

What they’re bad at

Logs don’t aggregate well. Answering “How many errors happened today?” by scanning millions of log entries is slow and expensive.

One common misuse

Using logs for trend analysis. Queries like “Count errors over the last week” should use metrics, not log aggregation. Log-based analytics tools work, but they’re slower and costlier than purpose-built metric systems.

What they’re best for

Metrics are pre-aggregated counters, gauges, and histograms. Use them for dashboards, alerts, and trend analysis where you care about patterns, not individual events.

Examples:

  • “How many requests per second?”
  • “What’s the 95th percentile latency?”
  • “Alert when CPU > 80%”

Metrics are cheap to store and query because they discard detail. Instead of saving every request, you store a count. Instead of every latency value, you store percentiles.

What they’re bad at

Metrics don’t tell you why something happened. If your error rate spikes, metrics show the spike but not which endpoints failed or what the errors were. You need logs or traces to debug.

One common misuse

Adding too many cardinality dimensions. Tagging metrics with user IDs, request IDs, or transaction IDs creates millions of unique time series, exploding storage costs and query times. Metrics work best with low-cardinality tags (service name, endpoint, region).

What they’re best for

Traces show request flow across services. Use them to debug latency, understand dependencies, or find where a distributed request failed.

Examples:

  • “Why is this endpoint slow?”
  • “Which service is causing timeouts?”
  • “Show me the call path for request X”

Traces answer where in a distributed system something went wrong. They link spans across services, showing you the exact sequence of calls and their durations.

What they’re bad at

Traces are expensive at scale. Running analytics like “Find all traces with latency > 1s” requires sampling or indexing.

One common misuse

Tracing everything without sampling. Full tracing at high request volumes creates massive data pipelines. Most teams sample traces (1-10%) and only keep slow or failed requests at 100%.

QuestionUse This
How many errors happened?Metrics
Why did this request fail?Logs
Which service is slow?Traces
Is the database overloaded?Metrics
What did user X do?Logs
Where is latency coming from?Traces
Alert on high CPUMetrics
Debug a crashLogs
Visualize request flowTraces

Most teams should start with:

Metrics

  • Request rate, error rate, latency (per service, per endpoint)
  • Infrastructure: CPU, memory, disk, network
  • Business metrics: signups, purchases, active users
  • Alert on rate-of-change, not absolute thresholds

Logs

  • Errors and exceptions (always)
  • Authentication events (login, logout, failures)
  • Critical business events (payments, orders)
  • Sample debug logs at 1-10% in production

Traces

  • Sample 1-5% of successful requests
  • Trace 100% of errors and slow requests (> P95 latency)
  • Focus on critical paths (checkout, API requests, auth flows)

This baseline gives you dashboards (metrics), debugging context (logs), and latency insight (traces) without overwhelming your systems or budget.

Avoid these early mistakes:

  • Logging everything at DEBUG level in production (expensive, noisy)
  • Metrics with unbounded cardinality (user IDs, session IDs)
  • Tracing 100% of traffic without sampling

The signals complement each other:

  1. Metric spike → Check logs for error details → Use traces to find the slow service
  2. Slow trace → Check logs for database queries → Metrics show DB connection pool exhaustion
  3. Error logs → Metrics show which endpoints → Traces show upstream service timeouts

Think of metrics as your dashboard, logs as your debugger, and traces as your map. Each signal serves different questions.

High error rate (metrics) + specific errors (logs)

Metrics alert you that errors spiked. Logs show you the actual error messages and stack traces.

Slow endpoint (metrics) + request path (traces)

Metrics show P95 latency increased. Traces show which downstream service is adding latency.

Failed deployment (all three)

Metrics show request rate dropped, error rate spiked. Logs show new exceptions. Traces show timeouts to a new service version.

Logs-only systems

You can debug individual failures but lack trend visibility for alerting and capacity planning.

Metrics-only systems

You see that something is wrong but not what or why. Requires additional instrumentation to debug.

Traces-only systems

You see latency and call paths but lack error context for root cause analysis.

What’s the difference between logs and traces?

Logs capture individual events with full context (errors, messages, stack traces), while traces track requests across multiple services showing call paths and latency. Use logs to debug “why it failed,” traces to debug “where it’s slow.”

Should I use metrics or logs for error tracking?

Use both. Metrics alert you when error rates spike (fast, cheap queries). Logs give you the actual error messages and stack traces needed to fix the issue.

How much should I sample traces?

Start with 1-5% sampling for normal requests, 100% for errors and slow requests (above P95 latency). Adjust based on traffic volume and storage costs.

Can I skip metrics and just use logs?

No. Aggregating logs for dashboards and alerts is slow and expensive. Metrics are purpose-built for trend analysis, with orders of magnitude better performance.

Observability isn’t about choosing one signal—it’s about using the right signal for the question.

  • Metrics for trends and alerts
  • Logs for debugging and context
  • Traces for distributed latency

Start with the minimal setup, sample aggressively, and add detail only where you need it. Many teams over-log and under-invest in metrics early. Balancing all three signals leads to faster debugging and lower operational costs.

Olga S.

Founder of NewsDataHub

Connect on LinkedIn