Trace a Request with Grafana Tempo
Goal
Find and analyze a specific request trace using Grafana Tempo to debug latency issues, identify bottlenecks, or investigate errors across microservices.
Prerequisites
Before you begin, ensure you have:
- [ ] Application instrumented with OpenTelemetry (see Distributed Tracing)
- [ ] Access to Grafana UI (
https://grafana.127.0.0.1.nip.io) - [ ] Tempo data source configured in Grafana
- [ ] Trace ID from application logs (for direct lookup) OR time window of the issue
Steps
Method 1: Find Trace by Trace ID (Direct Lookup)
Use this method when you have a specific trace ID from application logs.
1. Locate Trace ID in Application Logs
Application logs should include trace IDs. Example log entry:
{
"timestamp": "2024-12-06T10:30:45Z",
"level": "ERROR",
"message": "Payment processing failed",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"service": "payment-service",
"error": "Database connection timeout"
}
2. Open Grafana Explore
- Navigate to Grafana:
https://grafana.127.0.0.1.nip.io - Click Explore (compass icon) in the left sidebar
- Select Tempo from the data source dropdown
3. Search by Trace ID
- In the query builder, select Search tab
- Enter the trace ID in the Trace ID field:
4bf92f3577b34da6a3ce929d0e0e4736 - Click Run query (or press Shift+Enter)
4. Analyze the Trace
The trace view displays:
- Service Map: Visual representation of service calls
- Timeline: Waterfall view showing span duration
- Span Details: Click any span to see:
- Span name and duration
- HTTP method, status code, route
- Database queries
- Error messages
- Custom attributes
Method 2: Search Traces Using TraceQL
Use this method to find traces matching specific criteria (errors, slow requests, specific service).
1. Open Tempo in Grafana Explore
- Navigate to Explore in Grafana
- Select Tempo data source
2. Write TraceQL Query
Use TraceQL to search for traces. Common queries:
Find all error traces in the last hour:
{status = error}
Find slow requests (> 1 second):
{duration > 1s}
Find traces for a specific service:
{resource.service.name = "payment-service"}
Find traces with database errors:
{span.db.system = "postgresql" && status = error}
Find traces for a specific HTTP route:
{span.http.route = "/api/orders/*" && span.http.method = "POST"}
Combine multiple conditions:
{
resource.service.name = "payment-service" &&
duration > 500ms &&
span.http.status_code = 500
}
3. Execute Query
- Enter your TraceQL query in the query editor
- Set the time range (e.g., "Last 1 hour")
- Click Run query
4. Review Search Results
Results show:
- Trace ID
- Root service name
- Trace duration
- Start time
- Number of spans
- Number of errors
Click on a trace to view details.
Method 3: Trace-to-Logs Correlation
Use this method to jump from logs to traces.
1. Open Logs in Grafana
- Navigate to Explore in Grafana
- Select OpenSearch (or Loki) data source
2. Query Application Logs
Example query for recent errors:
{namespace="my-service-dev"} |~ "ERROR"
3. Click Trace ID Link
In the log results:
- Find a log entry with a
trace_idfield - Click on the trace ID link (appears as a blue hyperlink)
- Grafana automatically switches to Tempo and loads the trace
Method 4: Trace from Service Map
Use this method to explore service dependencies and find problematic traces.
1. Open Service Map
- In Grafana, navigate to Explore
- Select Tempo data source
- Click on Service Graph tab
2. Identify Problematic Service
The service map shows:
- Nodes: Each service in your architecture
- Edges: Request flow between services
- Colors: Green (healthy), Yellow (warnings), Red (errors)
- Size: Proportional to request volume
3. Search Traces for That Service
- Click on a service node showing high latency or errors
- The query builder auto-populates with the service name
- Add additional filters (e.g., errors only)
- Click Run query
Method 5: Search by Custom Attributes
Use custom span attributes to find specific user journeys or business flows.
1. Build Query with Custom Attributes
Example queries using custom attributes:
Find traces for a specific user:
{span.user.id = "alice@example.com"}
Find traces for a specific order:
{span.order.id = "ORD-12345"}
Find traces with feature flag enabled:
{span.feature.flag.new_checkout = "true"}
2. Combine with Other Filters
{
span.user.id = "alice@example.com" &&
resource.service.name = "checkout-service" &&
duration > 2s
}
Verification
1. Verify Trace Completeness
A complete trace should show:
- ✅ All services involved in the request path
- ✅ Parent-child relationships between spans
- ✅ No missing spans (gaps in the timeline)
- ✅ Consistent trace context propagation
Check for missing spans:
- Look for gaps in the waterfall view
- Verify each service call has a corresponding child span
- Check for orphaned spans (no parent)
2. Identify Bottlenecks
Find the slowest operation in the trace:
- Sort spans by duration (longest first)
- Look for spans consuming >50% of total trace time
- Common bottlenecks:
- Slow database queries
- External API calls
- Expensive computation
- Network latency between services
3. Analyze Error Attribution
For traces with errors:
- Find the first span with
status = error - Check span attributes for error details:
error.type: Exception class nameerror.message: Error descriptionerror.stack: Stack trace- Trace backwards to find the root cause
4. Verify Span Attributes
Inspect key span attributes:
Service: payment-service
Operation: POST /api/payments
Duration: 1.2s
Attributes:
http.method: POST
http.route: /api/payments
http.status_code: 500
http.url: https://api.example.com/api/payments
db.system: postgresql
db.statement: SELECT * FROM payments WHERE id = $1
db.statement.duration: 850ms
error.type: ConnectionTimeoutError
error.message: Database connection timeout after 1000ms
user.id: alice@example.com
payment.amount: 99.99
payment.currency: USD
Understanding Trace Visualization
Waterfall View
The waterfall view shows spans in chronological order:
┌─────────────────────────────────────────────────────────────┐
│ payment-service: POST /api/payments [1.2s] ▓▓▓▓▓▓▓ │
│ ├─ validate-payment [50ms] ▓ │
│ ├─ check-fraud [200ms] ▓▓ │
│ ├─ database: SELECT [850ms] ▓▓▓▓▓▓ │
│ └─ send-confirmation [100ms] ▓ │
└─────────────────────────────────────────────────────────────┘
Time →
Key indicators:
- Width: Span duration (wider = slower)
- Color: Status (green = success, red = error)
- Nesting: Parent-child relationships
- Gaps: Network latency or waiting time
Service Dependency Graph
Shows the flow of requests across services:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Frontend │ ─────▶ │ API GW │ ─────▶ │ Payment │
│ │ │ │ │ Service │
└──────────────┘ └──────────────┘ └───────┬──────┘
│
▼
┌──────────────┐
│ PostgreSQL │
└──────────────┘
Span Timeline
Detailed view of a single span:
Span: database-query
Duration: 850ms
Start: 2024-12-06 10:30:45.123
End: 2024-12-06 10:30:45.973
Timeline:
0ms ──────────▶ Connection acquired
50ms ─────────▶ Query sent to database
800ms ────────▶ Results received
850ms ────────▶ Span complete
Troubleshooting
No Traces Found for Trace ID
Cause: Trace ID not in Tempo, or retention period expired.
Solution:
# Check Tempo health
kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
# Verify trace retention period
kubectl get configmap tempo-config -n monitoring -o yaml | grep retention
# Check OpenTelemetry Collector is sending traces
kubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector | grep -i tempo
Trace Shows Incomplete Spans
Cause: Missing OpenTelemetry instrumentation or broken trace context propagation.
Solution:
- Verify all services are instrumented with OpenTelemetry
- Check for proper trace context propagation:
# Logs should include trace_id in each service kubectl logs -n my-service-dev deployment/my-service | grep trace_id - Verify HTTP headers are propagated (
traceparent,tracestate)
TraceQL Query Returns No Results
Cause: Incorrect query syntax or no matching traces.
Solution:
# Start with broad query
{ } # Returns all traces in time range
# Gradually narrow down
{resource.service.name = "payment-service"}
# Check attribute spelling and case sensitivity
{span.http.method = "POST"} # Correct
{span.HTTP.Method = "POST"} # Incorrect (case-sensitive)
Trace Shows High Latency but No Bottleneck
Cause: Network latency or waiting between spans.
Solution:
- Look for gaps in the waterfall view (time between spans)
- Check for asynchronous operations or queued tasks
- Verify inter-service network latency:
# Test network latency between pods kubectl exec -n namespace1 pod1 -- curl -w "@curl-format.txt" -o /dev/null -s http://service2.namespace2.svc.cluster.local
Next Steps
After mastering request tracing:
- View DORA Metrics in DevLake - Analyze deployment performance
- Configure Alerts - Set up proactive monitoring
- Optimize Performance - Use trace data to improve latency
- Instrument Custom Spans - Add custom tracing
Related Documentation
- Distributed Tracing with OpenTelemetry and Tempo - Architecture and setup
- ADR-013: Distributed Tracing - Technical decisions
- Centralized Logging - Trace-to-logs correlation
- Brown Belt Module: Observability - Hands-on training