ADR-013: Distributed Tracing for Platform and Applications
Status
Accepted
Context
The Fawkes platform consists of multiple interconnected services where a single user request may traverse numerous components:
Platform Request Flows:
- Developer Portal Access: User → NGINX Ingress → Backstage → PostgreSQL → GitHub API → ArgoCD API
- CI/CD Pipeline: Git Push → Jenkins Webhook → Jenkins Build → Harbor Push → ArgoCD Sync → Kubernetes Deployment
- Dojo Learning: User → Backstage → Lab Provisioning Service → Terraform → AWS API → Kubernetes API
- Collaboration: User → Mattermost → PostgreSQL → S3 (file storage) → Elasticsearch (search)
- Deployment: Developer → ArgoCD UI → Kubernetes API → Application Pods → Database
Application Request Flows:
- Microservice architectures with service-to-service calls
- Database queries across multiple services
- External API integrations
- Message queue processing
- Asynchronous job execution
Troubleshooting Challenges Without Tracing:
- Latency Attribution: "Why is this request slow?" - Which service is the bottleneck?
- Error Root Cause: "Where did this error originate?" - Which service in the chain failed?
- Dependency Mapping: "What services does this request touch?" - Understanding call paths
- Performance Optimization: "Which database queries are slow?" - Query-level visibility
- Cascading Failures: "Why are all services degraded?" - Tracing failure propagation
- Cross-Team Debugging: Multiple teams own services in a request path
DORA Metrics Requirements:
- Lead Time for Changes: Trace deployment pipeline from commit to production
- Change Failure Rate: Correlate failed deployments with application errors
- Mean Time to Recovery: Quickly identify root cause of incidents
- Deployment Frequency: Understand deployment pipeline performance
Technical Requirements:
- Support for multiple programming languages (Java, Python, Node.js, Go)
- Low overhead (<5% CPU, <100MB memory per service)
- Sampling strategies to control data volume
- Integration with existing observability (metrics, logs)
- Correlation IDs linking traces to logs
- Support for synchronous and asynchronous operations
- Trace context propagation across HTTP, gRPC, message queues
- Long-term trace storage for trend analysis
- Real-time trace querying for active troubleshooting
Operational Requirements:
- Automatic instrumentation where possible (minimal code changes)
- Manual instrumentation for custom spans
- Scalable backend (handle 100K+ spans/second)
- Data retention policy (7 days detailed, 30 days sampled, 90 days aggregated)
- Multi-tenancy (namespace/team isolation)
- Integration with Grafana for visualization
- Alert on trace-based SLIs (P95 latency, error rates)
Security Requirements:
- Sensitive data scrubbing (PII, credentials)
- Access control for trace data
- Encryption in transit and at rest
- Compliance with data retention policies
Learning & Dojo Requirements:
- Learners should understand distributed tracing concepts
- Hands-on labs demonstrating tracing implementation
- Troubleshooting exercises using traces
- Integration with Brown Belt (Observability & SRE) curriculum
Decision
We will use Grafana Tempo as the distributed tracing backend, integrated with OpenTelemetry for instrumentation and trace collection.
Architecture
┌──────────────────────────────────────────────────────────────────┐
│ Applications & Platform Services │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐ │
│ │ Backstage │ │ Jenkins │ │ ArgoCD │ │ Custom │ │
│ │ (Node.js) │ │ (Java) │ │ (Go) │ │ Apps │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ └────┬─────┘ │
│ │ │ │ │ │
│ │ OpenTelemetry SDK instrumentation │ │
│ └───────────────┴───────────────┴──────────────┘ │
│ │ │
└──────────────────────────────┼───────────────────────────────────┘
│ OTLP (gRPC/HTTP)
│
┌──────────────────────────────┼───────────────────────────────────┐
│ Kubernetes Cluster │ │
│ ▼ │
│ ┌────────────────────────────────────────────────┐ │
│ │ OpenTelemetry Collector (DaemonSet/Deployment) │ │
│ │ - Receives traces via OTLP │ │
│ │ - Batching and buffering │ │
│ │ - Sampling strategies │ │
│ │ - Tail-based sampling (intelligent) │ │
│ │ - Attribute processing and enrichment │ │
│ │ - Sensitive data scrubbing │ │
│ └──────────────────────┬─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Grafana Tempo (Trace Storage) │ │
│ │ - Object storage backend (S3/MinIO) │ │
│ │ - Block-based columnar storage │ │
│ │ - TraceQL query language │ │
│ │ - Multi-tenancy support │ │
│ └──────────────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Grafana (Visualization) │ │
│ │ - Trace search and visualization │ │
│ │ - Trace-to-logs correlation │ │
│ │ - Trace-to-metrics correlation │ │
│ │ - Service dependency graphs │ │
│ │ - RED metrics from traces │ │
│ └──────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────┐
│ Integration with Observability Stack │
│ │
│ Tempo ←→ Prometheus (Exemplars link metrics to traces) │
│ Tempo ←→ Loki (Trace IDs in logs for correlation) │
│ Tempo ←→ Grafana (Unified visualization) │
└───────────────────────────────────────────────────────────────────┘
Technology Stack
Instrumentation: OpenTelemetry SDKs
- Java:
opentelemetry-java-instrumentation(auto-instrumentation agent) - Python:
opentelemetry-distrowithopentelemetry-instrumentation - Node.js:
@opentelemetry/sdk-nodewith auto-instrumentation - Go:
go.opentelemetry.io/otelwith manual instrumentation
Collection: OpenTelemetry Collector
- Deployed as DaemonSet (node-level collection)
- Deployed as Deployment (centralized processing)
- OTLP receivers (gRPC and HTTP)
- Tail-based sampling processor
- Batch processor for efficiency
Storage: Grafana Tempo
- Backend: S3-compatible object storage (AWS S3, MinIO)
- Retention: 7 days full detail, 30 days sampled
- Compression: Snappy/LZ4 for cost efficiency
- Ingestion rate: 100K+ spans/second
Visualization: Grafana
- Tempo data source integration
- TraceQL query builder
- Node graph visualization
- Trace comparison tools
OpenTelemetry Instrumentation Strategy
Automatic Instrumentation (Preferred for rapid adoption):
Java Applications (Jenkins plugins, Spring Boot apps):
# Download OpenTelemetry Java agent
wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
# Add to JVM arguments
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=jenkins \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-jar jenkins.war
Python Applications (FastAPI, Django):
# Install OpenTelemetry distro
pip install opentelemetry-distro opentelemetry-exporter-otlp
# Auto-instrument
opentelemetry-bootstrap -a install
opentelemetry-instrument \
--service_name my-python-app \
--exporter_otlp_endpoint http://otel-collector:4317 \
python app.py
Node.js Applications (Backstage, Express):
// app.js - Add at the very top
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const sdk = new NodeSDK({
serviceName: "backstage",
traceExporter: new OTLPTraceExporter({
url: "http://otel-collector:4318/v1/traces",
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Manual Instrumentation (For custom spans):
Go Example (ArgoCD, custom controllers):
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
func processDeployment(ctx context.Context, app string) error {
tracer := otel.Tracer("argocd")
ctx, span := tracer.Start(ctx, "process-deployment",
trace.WithAttributes(
attribute.String("app.name", app),
attribute.String("namespace", "default"),
),
)
defer span.End()
// Business logic here
if err := syncApplication(ctx, app); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
return nil
}
OpenTelemetry Collector Configuration
DaemonSet Deployment (for application traces):
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: fawkes-observability
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
# Tail-based sampling - keep all errors, sample successes
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-sampling
type: probabilistic
probabilistic: {sampling_percentage: 10}
# Add Kubernetes metadata
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
# Scrub sensitive data
attributes:
actions:
- key: http.request.header.authorization
action: delete
- key: http.request.header.cookie
action: delete
- key: db.statement
action: hash
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
# Also export to Prometheus for exemplars
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [k8sattributes, tail_sampling, attributes, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Grafana Tempo Configuration
Deployment Manifest:
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
namespace: fawkes-observability
data:
tempo.yaml: |
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
ingester:
trace_idle_period: 10s
max_block_bytes: 1_000_000
max_block_duration: 5m
compactor:
compaction:
block_retention: 168h # 7 days
storage:
trace:
backend: s3
s3:
bucket: fawkes-tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
wal:
path: /var/tempo/wal
pool:
max_workers: 100
queue_depth: 10000
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: tempo
namespace: fawkes-observability
spec:
serviceName: tempo
replicas: 3
selector:
matchLabels:
app: tempo
template:
metadata:
labels:
app: tempo
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args:
- -config.file=/etc/tempo/tempo.yaml
ports:
- containerPort: 3200
name: http
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http
volumeMounts:
- name: config
mountPath: /etc/tempo
- name: storage
mountPath: /var/tempo
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
volumes:
- name: config
configMap:
name: tempo-config
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Trace Correlation with Logs
Log Entry with Trace Context:
{
"timestamp": "2024-12-07T10:30:45Z",
"level": "ERROR",
"message": "Failed to sync application",
"service": "argocd",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"trace_flags": "01",
"namespace": "production",
"app_name": "payment-service"
}
Loki Configuration for Trace Correlation:
# Grafana data source configuration
apiVersion: 1
datasources:
- name: Loki
type: loki
uid: loki
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
Grafana Dashboard Configuration
Service Dependency Graph:
- Automatically generated from trace data
- Shows request flow between services
- Color-coded by error rate
- Size proportional to request volume
TraceQL Queries:
Find slow database queries:
{span.db.system="postgresql" && duration > 1s}
Find all traces with errors in production:
{status=error && resource.namespace="production"}
Find traces for specific user:
{resource.service.name="backstage" && span.http.route="/api/catalog/*" && trace.user.id="alice@example.com"}
Deployment trace (commit to production):
{resource.service.name="jenkins" || resource.service.name="argocd"}
| {span.git.commit.sha="abc123def"}
Sampling Strategies
Head-Based Sampling (at application):
- 100% of errors
- 100% of requests > 1 second
- 10% of successful requests < 1 second
Tail-Based Sampling (at collector):
- Keep all traces with errors
- Keep all traces with latency > P95
- Keep traces matching specific criteria (user ID, request path)
- Sample remaining traces at 10%
Cost Optimization:
- Production: 10% sampling → ~50GB/day traces
- Staging: 50% sampling → ~20GB/day traces
- Development: 100% sampling → ~5GB/day traces
DORA Metrics Integration
Lead Time for Changes Tracing:
- Git commit event → creates trace context
- Jenkins build → child span with commit SHA
- Docker build → child span with image tag
- Harbor push → child span with artifact metadata
- ArgoCD sync → child span with deployment details
- Application start → final span with health check
Query to calculate lead time:
{span.git.commit.sha="abc123"}
| {span.name="deploy-to-production"}
Change Failure Rate:
- Trace deployments with
deployment.status=failed - Correlate with application error traces
- Generate failure rate dashboard
Mean Time to Recovery:
- Incident start → trace ID in alert
- Trace investigation → linked troubleshooting actions
- Incident resolution → final span
- Calculate MTTR from trace duration
Security & Privacy
Sensitive Data Scrubbing:
# OpenTelemetry Collector processor
processors:
attributes:
actions:
# Remove authorization headers
- key: http.request.header.authorization
action: delete
# Hash SQL statements (preserve structure, hide values)
- key: db.statement
action: hash
# Redact email addresses
- key: user.email
action: update
value: "REDACTED"
# Remove credit card numbers from URLs
- key: http.url
action: update
from_attribute: http.url
pattern: '\d{4}-\d{4}-\d{4}-\d{4}'
value: "XXXX-XXXX-XXXX-XXXX"
Access Control:
- Grafana RBAC for trace viewing
- Namespace-based trace isolation
- Tempo multi-tenancy (tenant ID from namespace)
Performance Impact
Benchmarks (per-service overhead):
- CPU: 2-5% increase
- Memory: 50-100MB increase
- Network: ~1KB per span (compressed)
- Latency: <1ms per instrumented operation
Production Optimizations:
- Use tail-based sampling
- Batch span exports (10s intervals)
- Compress spans before export
- Use gRPC for OTLP (more efficient than HTTP)
Consequences
Positive
- Root Cause Analysis: Quickly identify bottlenecks across distributed services
- Performance Optimization: Data-driven latency improvements
- Dependency Visualization: Automatic service dependency mapping
- Error Attribution: Precise identification of failing services
- DORA Metrics: End-to-end deployment pipeline visibility
- Cross-Team Collaboration: Shared visibility into request flows
- Unified Observability: Traces linked to metrics and logs
- Cost-Effective: Tempo uses object storage (~$10/TB/month vs. $100+ for commercial solutions)
- Vendor-Neutral: OpenTelemetry standard, not locked to Tempo
- Learning-Friendly: Clear visualization helps dojo learners understand distributed systems
Negative
- Learning Curve: Teams must understand tracing concepts and instrumentation
- Storage Costs: Trace data requires significant storage (~50GB/day for production)
- Instrumentation Effort: Applications require SDK integration
- Sampling Complexity: Tail-based sampling configuration requires tuning
- Performance Overhead: 2-5% CPU overhead per service
- Cardinality Challenges: High-cardinality attributes can degrade performance
Neutral
- Tempo Maturity: Tempo is newer than Jaeger/Zipkin but rapidly maturing
- TraceQL Learning: New query language to learn (though similar to LogQL/PromQL)
- Object Storage Dependency: Requires S3-compatible storage
Alternatives Considered
Alternative 1: Jaeger
Pros:
- CNCF graduated project, very mature
- Excellent UI with service dependency graphs
- Strong community and documentation
- Battle-tested in production
- Built-in sampling strategies
- Supports multiple storage backends (Cassandra, Elasticsearch, Badger)
Cons:
- Higher operational complexity (requires Elasticsearch or Cassandra for scale)
- Higher storage costs (~5x more than Tempo for equivalent data)
- Less integration with Grafana (requires separate UI)
- No native metrics generation from traces
- Elasticsearch/Cassandra adds infrastructure overhead
Reason for Rejection: Operational complexity and storage costs. Tempo's integration with Grafana provides unified observability, and object storage backend is significantly cheaper than Elasticsearch. Jaeger's maturity is valuable, but Tempo's simplicity better suits Fawkes' goals.
Alternative 2: Zipkin
Pros:
- Original distributed tracing system, very mature
- Simple architecture, easy to deploy
- Low resource requirements
- Multiple language SDK support
- Compatible with OpenTelemetry
Cons:
- Less feature-rich than modern alternatives
- No native Grafana integration
- Limited query capabilities
- In-memory storage default (poor retention)
- Requires Elasticsearch for production (added complexity)
- Smaller active community compared to CNCF projects
Reason for Rejection: While simple, Zipkin lacks modern features like TraceQL queries, Grafana integration, and metrics generation. Tempo provides better long-term value with similar deployment simplicity.
Alternative 3: AWS X-Ray
Pros:
- Fully managed service (no infrastructure to maintain)
- Native AWS integration (Lambda, ECS, EC2)
- Automatic instrumentation for AWS services
- Low operational overhead
- Pay-per-use pricing
Cons:
- Cloud vendor lock-in (AWS only)
- No multi-cloud support
- Limited customization
- Higher costs at scale (~$5/million traces)
- Cannot run on-premises or in dojo labs
- Separate UI from other observability tools
Reason for Rejection: Violates Fawkes' cloud-agnostic principle. Learners need portable skills, not cloud-specific tools. X-Ray's managed benefits don't outweigh lock-in costs.
Alternative 4: Elastic APM
Pros:
- Integrated with Elastic Stack (logs, metrics, traces in one UI)
- Excellent UI and visualization
- Strong Java and Node.js support
- Machine learning for anomaly detection
- Good documentation
Cons:
- Requires Elasticsearch cluster (high resource usage)
- Complex scaling and tuning
- Higher costs (compute + storage)
- OpenTelemetry support is secondary (Elastic APM agents preferred)
- Not CNCF/vendor-neutral
Reason for Rejection: Elasticsearch operational complexity and costs. Fawkes already uses Grafana for visualization, so Elastic APM adds redundancy. Tempo + Grafana provides equivalent value with lower operational burden.
Alternative 5: Lightstep / Honeycomb / Datadog APM (Commercial SaaS)
Pros:
- Best-in-class UX and query capabilities
- Advanced sampling and trace analysis
- Fully managed (zero operational burden)
- Superior support and documentation
- Advanced features (BubbleUp, trace comparison, service catalog)
Cons:
- Very high costs ($100-500/month per host)
- SaaS-only (not self-hosted)
- Data sent to third-party
- Cannot use in air-gapped or on-premises environments
- Not suitable for learner environments (cost prohibitive)
Reason for Rejection: Cost prohibitive for open-source platform. Learners cannot use these tools without organization sponsorship. Self-hosted Tempo provides 90% of functionality at 5% of cost.
Alternative 6: No Distributed Tracing (Logs + Metrics Only)
Pros:
- Lower complexity
- Smaller operational footprint
- Existing tools (Loki, Prometheus) sufficient
- No additional instrumentation required
Cons:
- Cannot trace requests across services (critical gap)
- Debugging distributed systems is extremely difficult
- No service dependency visualization
- Cannot calculate accurate DORA lead time metrics
- Poor learner experience (can't see end-to-end flows)
Reason for Rejection: Distributed tracing is essential for modern microservices platforms. DORA State of DevOps research shows observability (including tracing) strongly correlates with elite performance. Omitting tracing would cripple platform effectiveness.
Implementation Plan
Phase 1: Foundation (Week 6, Days 1-2)
Day 1: OpenTelemetry Collector Deployment [4 hours]
- Deploy OpenTelemetry Collector as DaemonSet
- Configure OTLP receivers (gRPC + HTTP)
- Set up batch and tail-sampling processors
- Test with sample trace data
- Verify Prometheus metrics export
Day 2: Grafana Tempo Deployment [4 hours]
- Deploy Tempo StatefulSet (3 replicas)
- Configure S3/MinIO backend storage
- Set up retention policies
- Configure Grafana data source
- Test trace ingestion and querying
Phase 2: Platform Service Instrumentation (Week 6, Days 3-5)
Day 3: Backstage Tracing [3 hours]
- Add OpenTelemetry Node.js SDK to Backstage
- Configure auto-instrumentation
- Test trace collection for catalog API calls
- Add custom spans for plugin operations
- Verify Grafana visualization
Day 4: Jenkins & ArgoCD Tracing [4 hours]
- Jenkins: Add OpenTelemetry Java agent to JVM
- Configure trace export for pipeline execution
- ArgoCD: Manual Go instrumentation for sync operations
- Test deployment trace (Jenkins → ArgoCD)
- Create Grafana dashboard for CI/CD traces
Day 5: Ingress & Database Tracing [3 hours]
- NGINX Ingress: Configure trace propagation headers
- PostgreSQL: Add pg_stat_statements for query tracing
- Test end-to-end trace (User → NGINX → Backstage → PostgreSQL)
- Validate trace-to-log correlation
Phase 3: Application Instrumentation Templates (Week 7, Days 1-2)
Day 1: Create Language-Specific Templates [4 hours]
- Java Spring Boot template with OTel auto-instrumentation
- Python FastAPI template with OTel SDK
- Node.js Express template with OTel SDK
- Go template with manual instrumentation
- Document instrumentation patterns
Day 2: Golden Path Integration [4 hours]
- Update Backstage templates with OTel dependencies
- Add Dockerfile entries for OTel agents
- Update Helm charts with OTel environment variables
- Create CI/CD pipeline checks for instrumentation
- Test end-to-end application deployment with tracing
Phase 4: Observability Integration (Week 7, Days 3-5)
Day 3: Trace-to-Metrics Integration [3 hours]
- Configure Tempo metrics generator
- Send exemplars to Prometheus
- Create Grafana dashboard linking metrics to traces
- Add "View Trace" links from Prometheus alerts
Day 4: Trace-to-Logs Integration [3 hours]
- Configure Loki derived fields for trace IDs
- Update application logging to include trace context
- Test correlation (click trace ID in logs → opens trace in Tempo)
- Create unified dashboard with logs + traces
Day 5: DORA Metrics Tracing [4 hours]
- Instrument deployment pipeline with trace context
- Track commit SHA through build → deploy → release
- Calculate lead time from traces
- Create DORA dashboard with trace links
Phase 5: Performance & Optimization (Week 8, Days 1-2)
Day 1: Sampling Optimization [3 hours]
- Analyze trace volumes and costs
- Tune tail-based sampling policies
- Implement adaptive sampling based on load
- Verify sample representativeness
- Document sampling strategies
Day 2: Performance Testing [3 hours]
- Load test with tracing enabled (measure overhead)
- Benchmark trace ingestion rates
- Test Tempo query performance
- Optimize collector configurations
- Document performance baseline
Phase 6: Documentation & Training (Week 8, Days 3-5)
Day 3: Platform Documentation [4 hours]
- Architecture overview with data flow diagrams
- Instrumentation guide for each language
- TraceQL query examples and cookbook
- Troubleshooting guide
- Runbook for Tempo operations
Day 4: Dojo Module - Brown Belt [4 hours]
- Module: "Distributed Tracing for Microservices"
- Theory: Trace context, spans, sampling
- Hands-on lab: Instrument sample app, query traces
- Troubleshooting exercise: Debug slow request
- Assessment quiz on tracing concepts
Day 5: Dashboard & Playbooks [4 hours]
- Create standard Grafana dashboards:
- Service dependency graph
- Request latency heatmap
- Error rate by service
- Deployment trace timeline
- Create troubleshooting playbooks
- Document common trace patterns
- Create video walkthrough (15 minutes)
Dojo Integration
Brown Belt - Module 6: "Distributed Tracing & Request Flow Analysis"
Learning Objectives:
- Understand distributed tracing concepts (spans, traces, context propagation)
- Implement OpenTelemetry instrumentation in applications
- Query traces using TraceQL
- Correlate traces with metrics and logs
- Debug performance issues using distributed tracing
- Calculate DORA lead time metrics from traces
Hands-On Lab (90 minutes):
Part 1: Instrument a Microservice (30 min)
- Deploy sample 3-tier app (frontend → API → database)
- Add OpenTelemetry SDK to each service
- Configure trace export to Tempo
- Generate traffic and view traces in Grafana
- Observe request flow across services
Part 2: Advanced Querying (30 min)
- Write TraceQL queries to find:
- Slow database queries
- Requests with errors
- Specific user journeys
- Deployment traces
- Create custom Grafana dashboard
- Set up trace-based alerts
Part 3: Troubleshooting Exercise (30 min)
- Scenario: Application experiencing intermittent slowness
- Task: Use traces to identify:
- Which service is the bottleneck
- Slow database queries
- External API latency
- Network issues
- Deliverable: Root cause analysis report with trace evidence
Assessment:
- Quiz: 10 questions on tracing concepts
- Practical: Instrument new service and create dashboard
- Troubleshooting: Debug broken trace (missing context propagation)
Time: 2 hours (30 min theory + 90 min hands-on + assessment)
Monitoring & Observability
Tempo Health Metrics
Key Metrics to Monitor:
tempo_ingester_bytes_received_total- Trace ingestion ratetempo_ingester_blocks_flushed_total- Block flush ratetempo_query_frontend_result_metrics_inspected_bytes- Query performancetempo_distributor_spans_received_total- Span reception ratetempo_compactor_blocks_compacted_total- Compaction health
Grafana Dashboard: "Tempo Operations"
- Ingestion rate by tenant
- Query latency (P50, P95, P99)
- Storage usage and growth
- Compaction lag
- Error rates
Alerting Rules:
groups:
- name: tempo_alerts
rules:
- alert: TempoHighIngestionErrors
expr: rate(tempo_distributor_spans_received_total{status="error"}[5m]) > 100
for: 10m
annotations:
summary: "High trace ingestion error rate"
- alert: TempoHighQueryLatency
expr: histogram_quantile(0.95, tempo_query_frontend_duration_seconds_bucket) > 10
for: 5m
annotations:
summary: "Tempo query latency P95 > 10s"
- alert: TempoStorageUsageHigh
expr: tempo_ingester_bytes_metric_total > 100e9 # 100GB
annotations:
summary: "Tempo ingester storage usage high"
OpenTelemetry Collector Health
Key Metrics:
otelcol_receiver_accepted_spans- Spans receivedotelcol_receiver_refused_spans- Spans refused (backpressure)otelcol_processor_batch_batch_send_size- Batch sizesotelcol_exporter_sent_spans- Spans successfully exportedotelcol_exporter_send_failed_spans- Export failures
Dashboard: "OpenTelemetry Collector Health"
- Span throughput (in/out)
- Processor queue depth
- Export success/failure rates
- Memory usage by component
Security Considerations
Data Privacy
Sensitive Data Scrubbing:
- Remove authorization headers
- Hash SQL statements
- Redact PII from URLs and headers
- Obfuscate API keys in trace attributes
Access Control:
- Grafana RBAC for trace viewing
- Namespace-based trace isolation (teams only see their traces)
- Audit logging for trace access
Compliance
Data Retention:
- 7 days detailed traces (compliance with GDPR "right to be forgotten")
- 30 days sampled traces
- Automated deletion after retention period
Encryption:
- TLS for all trace transmission (OTLP over gRPC/HTTPS)
- S3 server-side encryption for stored traces
- No plaintext credentials in trace attributes
Cost Analysis
Storage Costs (Production)
Assumptions:
- 100 services generating traces
- 1,000 requests/second average
- 10 spans per trace average
- 10% sampling rate (tail-based)
- ~1KB per span (compressed)
Daily Trace Volume:
- Raw spans: 100 services × 1,000 req/s × 10 spans × 86,400 s = 86.4 billion spans/day
- After sampling: 8.64 billion spans/day
- Storage: 8.64B spans × 1KB = 8.64 TB/day (before compression)
- After compression (10:1): ~864 GB/day
Monthly Storage (7-day retention):
- 864 GB/day × 7 days = 6 TB active storage
- S3 Standard: $0.023/GB = $138/month
- S3 Intelligent-Tiering (after 7 days): $0.0125/GB = $75/month
Total Monthly Cost: ~$213/month (vs. $5,000+/month for commercial APM at 100 services)
Cost Optimization:
- Increase sampling rate in non-production (less cost-sensitive)
- Use S3 Lifecycle policies to move old traces to Glacier
- Tune tail-based sampling to focus on valuable traces
- Compress spans aggressively (Snappy → LZ4)
Infrastructure Costs
Tempo Pods:
- 3 replicas × 2 CPU × $0.04/CPU/hour = $5.76/day = $173/month
- 3 replicas × 4GB RAM × $0.005/GB/hour = $1.44/day = $43/month
OpenTelemetry Collector:
- DaemonSet (1 per node, 10 nodes): 10 × 0.1 CPU × $0.04 = $0.96/day = $29/month
- DaemonSet memory: 10 × 128MB × $0.005/GB/hour = negligible
Total Infrastructure: ~$245/month
Grand Total: ~$458/month for distributed tracing (100 services, production scale)
Documentation Structure
For Platform Teams
-
Architecture & Design
-
Trace collection flow
- Sampling strategies explained
- Storage architecture (S3 layout)
-
Query performance optimization
-
Deployment Guide
-
Helm chart installation (Tempo, OTel Collector)
- Cloud-specific configurations (AWS, Azure, GCP)
- Scaling guidelines
-
Backup and disaster recovery
-
Operations Runbook
- Common troubleshooting scenarios
- Tempo upgrade procedures
- Storage management (compaction, cleanup)
- Performance tuning guide
For Application Teams
-
Instrumentation Guide
-
Language-specific SDKs (Java, Python, Node.js, Go)
- Auto vs. manual instrumentation
- Custom span creation
-
Best practices (span naming, attributes)
-
Querying & Troubleshooting
-
TraceQL query cookbook
- Common debugging patterns
- Grafana dashboard usage
-
Trace-to-logs/metrics correlation
-
Performance Impact
- Overhead benchmarks
- Sampling recommendations
- Optimization techniques
For Dojo Learners
-
Concepts Tutorial
-
What is distributed tracing?
- Spans, traces, and context propagation
- When to use tracing vs. logs/metrics
-
Real-world use cases
-
Hands-On Labs
-
Lab 1: Instrument a simple app
- Lab 2: Query traces with TraceQL
- Lab 3: Debug performance issue
-
Lab 4: Trace a deployment pipeline
-
Reference Materials
- OpenTelemetry SDK quick reference
- TraceQL cheat sheet
- Common trace patterns
- Troubleshooting decision tree
Related Decisions
- ADR-011: Centralized Log Management (Loki) - Trace-to-log correlation via trace IDs
- ADR-012: Metrics Monitoring (Prometheus/Grafana) - Exemplars link metrics to traces
- ADR-002: Backstage for Developer Portal - Primary UI instrumentation
- ADR-004: Jenkins for CI/CD - Pipeline trace instrumentation
- ADR-003: ArgoCD for GitOps - Deployment trace instrumentation
- Future ADR: Service Mesh (Istio) - Alternative instrumentation via sidecar proxies
References
- OpenTelemetry Documentation: https://opentelemetry.io/docs/
- Grafana Tempo Documentation: https://grafana.com/docs/tempo/
- TraceQL Language Reference: https://grafana.com/docs/tempo/latest/traceql/
- CNCF Distributed Tracing Best Practices: https://github.com/cncf/tag-observability
- Distributed Tracing Patterns (book) by Austin Parker
- OpenTelemetry Best Practices: https://opentelemetry.io/docs/concepts/instrumentation/
Notes
Production Readiness Checklist:
- [ ] Tempo deployed with 3+ replicas for HA
- [ ] S3/MinIO backend configured with proper retention
- [ ] OpenTelemetry Collector deployed (DaemonSet + Deployment)
- [ ] Tail-based sampling configured and tested
- [ ] Grafana data source configured with Tempo
- [ ] Trace-to-logs correlation working (Loki derived fields)
- [ ] Trace-to-metrics correlation working (Prometheus exemplars)
- [ ] Sensitive data scrubbing verified
- [ ] Performance benchmarks completed (overhead < 5%)
- [ ] Monitoring dashboards created
- [ ] Alerting rules configured
- [ ] Documentation complete
- [ ] Team trained on querying and troubleshooting
Learner Environment Considerations:
- Use in-memory Tempo backend for short-lived labs
- Pre-instrument sample applications
- Provide TraceQL query examples
- Include broken traces for troubleshooting practice
- Show real-world debugging scenarios
- Integrate with DORA metrics curriculum
Future Enhancements (Post-MVP):
- Service mesh integration (Istio sidecar auto-instrumentation)
- Trace-based SLO monitoring
- Anomaly detection from trace patterns
- Cost attribution per service from trace data
- Automated performance recommendations
- Trace replay for testing
Last Updated
December 7, 2024 - Initial version documenting Grafana Tempo + OpenTelemetry for distributed tracing