Distributed Tracing with OpenTelemetry and Tempo
Overview
Fawkes implements centralized distributed tracing for all platform services and applications using OpenTelemetry for instrumentation and Grafana Tempo for trace storage. This enables end-to-end request visibility across service boundaries, performance analysis, and rapid root cause identification.
Reference: See ADR-013 Distributed Tracing for architectural decisions.
Architecture
┌────────────────────────────────────────────────────────────────────┐
│ Applications & Platform Services │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ Backstage │ │ Jenkins │ │ ArgoCD │ │ Custom Apps │ │
│ │ (Node.js) │ │ (Java) │ │ (Go) │ │ (Any Lang) │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ └──────┬───────┘ │
│ │ │ │ │ │
│ │ OpenTelemetry SDK / Auto-instrumentation │ │
│ └───────────────┴───────────────┴────────────────┘ │
│ │ │
│ │ OTLP (gRPC/HTTP) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ OpenTelemetry Collector DaemonSet │ │
│ │ - OTLP receiver: Accepts traces from applications │ │
│ │ - K8s attributes: Enriches with pod/namespace metadata │ │
│ │ - Sampling: Configurable probabilistic sampling │ │
│ │ - Security: Scrubs sensitive data (auth headers, etc.) │ │
│ │ - Batching: Efficient export to Tempo │ │
│ └───────────────────────────┬─────────────────────────────────┘ │
│ │ OTLP/gRPC │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Grafana Tempo │ │
│ │ - Trace storage and querying │ │
│ │ - TraceQL for advanced queries │ │
│ │ - Service dependency graphs │ │
│ │ - Metrics generation (RED metrics from traces) │ │
│ └───────────────────────────┬─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Grafana │ │
│ │ - Trace visualization and flame graphs │ │
│ │ - Trace-to-logs correlation (OpenSearch) │ │
│ │ - Trace-to-metrics correlation (Prometheus) │ │
│ │ - Service dependency node graph │ │
│ └─────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
Key Features
1. Trace Generation and Collection
The OpenTelemetry Collector receives traces via OTLP protocol from instrumented applications:
- Protocol: OTLP over gRPC (port 4317) and HTTP (port 4318)
- Format: W3C Trace Context standard (
traceparent,tracestateheaders) - Enrichment: Automatic Kubernetes metadata (pod, namespace, deployment)
2. Cross-Service Propagation
Traces are automatically propagated across service boundaries using W3C Trace Context:
| Header | Purpose |
|---|---|
traceparent |
Contains trace ID, span ID, and trace flags |
tracestate |
Vendor-specific trace context |
3. Trace-Log Correlation
Every trace is correlated with application logs:
| Attribute | Description |
|---|---|
trace_id |
32-character hexadecimal trace identifier |
span_id |
16-character hexadecimal span identifier |
Click on a trace ID in logs to jump directly to the trace in Grafana.
4. Performance Visibility
Traces include key performance attributes:
- HTTP method, status code, and route
- Database query duration and statement (hashed)
- External service call latency
- Custom span attributes
5. Sampling Strategy
Development environment uses 100% sampling for full visibility. Production can be configured for:
- Head-based sampling: 10% of all requests
- Always sample: Errors and requests > 1 second latency
- Tail-based sampling: Keep interesting traces after collection
Configuration
OpenTelemetry Collector
The traces pipeline is configured in:
platform/apps/opentelemetry/otel-collector-application.yaml
Key configuration:
# Traces pipeline configuration
traces:
receivers:
- otlp
processors:
- memory_limiter
- probabilistic_sampler
- k8sattributes
- resourcedetection
- attributes/traces
- transform/traces
- batch/traces
exporters:
- otlp/tempo
Grafana Tempo
Tempo is deployed as the trace storage backend:
platform/apps/tempo/tempo-application.yaml
Key features:
- OTLP ingestion on ports 4317 (gRPC) and 4318 (HTTP)
- 7-day trace retention
- Metrics generation for RED metrics
- TraceQL query support
Grafana Data Sources
Grafana is configured with trace correlation:
platform/apps/grafana/helm-release.yml
Data sources configured:
- Tempo: Trace storage and visualization
- Prometheus: Trace-to-metrics correlation
- OpenSearch: Trace-to-logs correlation
Application Instrumentation
Java Applications (Spring Boot, Jenkins)
# Download OpenTelemetry Java agent
wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
# Add to JVM arguments
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=my-java-app \
-Dotel.exporter.otlp.endpoint=http://otel-collector.monitoring.svc.cluster.local:4317 \
-jar my-app.jar
Python Applications (FastAPI, Django)
# Install OpenTelemetry packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
# Auto-instrument application
opentelemetry-bootstrap -a install
opentelemetry-instrument \
--service_name my-python-app \
--exporter_otlp_endpoint http://otel-collector.monitoring.svc.cluster.local:4317 \
python app.py
Node.js Applications (Backstage, Express)
// tracing.js - Add at the very top of your application
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-grpc");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const sdk = new NodeSDK({
serviceName: "my-nodejs-app",
traceExporter: new OTLPTraceExporter({
url: "grpc://otel-collector.monitoring.svc.cluster.local:4317",
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Go Applications (ArgoCD, Custom Controllers)
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(),
otlptracegrpc.WithEndpoint("otel-collector.monitoring.svc.cluster.local:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("my-go-app"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
Usage
Querying Traces in Grafana
Access Grafana at http://grafana.127.0.0.1.nip.io and navigate to Explore → Tempo.
Find traces by service
{resource.service.name="backstage"}
Find slow traces (> 1 second)
{duration > 1s}
Find traces with errors
{status = error}
Find database queries
{span.db.system = "postgresql"}
Find traces by HTTP route
{span.http.route = "/api/catalog/*"}
Trace-to-Logs Correlation
- Open a trace in Grafana Tempo
- Click on any span
- Click "Logs for this span" to jump to OpenSearch logs
- Logs are filtered by trace ID and time range
Trace-to-Metrics Correlation
- Open a Prometheus dashboard
- Click on a data point with exemplars
- Click "View Trace" to jump to the associated trace
Environment Variables
| Variable | Description | Default |
|---|---|---|
TEMPO_URL |
Tempo API endpoint | http://tempo.monitoring.svc.cluster.local:3200 |
TEMPO_OTLP_GRPC_ENDPOINT |
OTLP gRPC ingestion endpoint | tempo.monitoring.svc.cluster.local:4317 |
TEMPO_OTLP_HTTP_ENDPOINT |
OTLP HTTP ingestion endpoint | http://tempo.monitoring.svc.cluster.local:4318 |
TRACE_SAMPLING_PERCENTAGE |
Sampling rate (1-100) | 100 (development) |
TRACE_CLUSTER_NAME |
Cluster identifier | fawkes-dev |
TRACE_ENVIRONMENT |
Environment label | development |
Monitoring
Tempo Health
Check Tempo health via API:
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl http://localhost:3200/ready
Collector Metrics
The OpenTelemetry Collector exposes metrics at port 8888:
otelcol_receiver_accepted_spans: Spans receivedotelcol_exporter_sent_spans: Spans exported to Tempootelcol_exporter_send_failed_spans: Export failuresotelcol_processor_batch_batch_send_size: Batch sizes
ZPages for Debugging
Access collector internal diagnostics:
kubectl port-forward -n monitoring daemonset/otel-collector 55679:55679
# Open http://localhost:55679/debug/tracez
Troubleshooting
Traces Not Appearing in Tempo
- Check collector pods are running:
kubectl get pods -n monitoring -l app.kubernetes.io/name=opentelemetry-collector
- Check collector logs for export errors:
kubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector --tail=100 | grep -i error
- Verify Tempo connectivity:
kubectl exec -n monitoring -it <collector-pod> -- wget -qO- http://tempo.monitoring.svc.cluster.local:3200/ready
Missing Kubernetes Attributes
Ensure the collector service account has proper RBAC permissions to read pod metadata.
Application Not Sending Traces
- Verify OTEL SDK is properly initialized
- Check
OTEL_EXPORTER_OTLP_ENDPOINTenvironment variable - Verify network connectivity to collector (port 4317/4318)
High Trace Volume / Cost
- Reduce
TRACE_SAMPLING_PERCENTAGE(e.g., 10 for production) - Configure tail-based sampling for error/latency-focused collection
- Review span cardinality and reduce high-cardinality attributes
Best Practices
Span Naming
- Use semantic, action-based names:
GET /api/users/{id},db.query - Avoid high-cardinality values in span names (user IDs, timestamps)
- Be consistent across services
Span Attributes
- Include relevant context:
user.id,order.id,feature.flag - Avoid sensitive data: passwords, tokens, PII
- Use OpenTelemetry semantic conventions
Error Handling
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
Custom Spans
Create spans for significant operations:
ctx, span := tracer.Start(ctx, "process-order",
trace.WithAttributes(
attribute.String("order.id", orderID),
attribute.Int("order.items", len(items)),
),
)
defer span.End()