Unified Telemetry: The OpenTelemetry Standard
Context
Modern applications emit three types of telemetry data:
- Metrics - Numeric measurements (CPU usage, request count, error rate)
- Logs - Event records (application logs, audit logs)
- Traces - Request flow across services (distributed tracing)
Historically, each observability vendor provided its own agent to collect this data:
- Datadog Agent (proprietary)
- New Relic Agent (proprietary)
- Dynatrace OneAgent (proprietary)
- APM Insight (proprietary)
The Problem: Vendor lock-in, duplication, and complexity.
Fawkes adopted OpenTelemetry (OTel) as the single standard for telemetry collection, replacing vendor-specific agents. This document explains why OpenTelemetry is a strategic architectural decision, not just a tool choice.
The Problem: Vendor Agent Chaos
The Old World: One Agent Per Vendor
graph TB
A[Application Pod] --> B[Datadog Agent]
A --> C[New Relic Agent]
A --> D[Prometheus Exporter]
A --> E[Fluentd Sidecar]
B --> F[Datadog Backend]
C --> G[New Relic Backend]
D --> H[Prometheus]
E --> I[Elasticsearch]
style A fill:#ff6b6b
What's Wrong Here?
-
Resource Duplication
-
4 agents per pod = 4x memory/CPU overhead
- Each agent has its own buffer, config, credentials
-
Cost: 200-400MB RAM per pod just for agents
-
Vendor Lock-In
-
Proprietary instrumentation libraries
- Can't switch backends without re-instrumenting code
-
Risk: Vendor pricing changes, service degradation, or shutdown
-
Inconsistent Data
-
Each vendor collects slightly differently
- Metric names don't match across tools
-
Problem: "Datadog shows 500 req/sec, but New Relic shows 520. Which is right?"
-
Configuration Sprawl
-
4 agents = 4 config files per service
- Updating sampling rate? Touch all 4 configs
-
Toil: Platform team maintains multiple agent configs
-
Security Surface
- Each agent needs credentials to its backend
- 4 agents = 4 secret rotation processes
- Attack Surface: More agents = more potential vulnerabilities
Real-World Vendor Lock-In Example
2019: Company X uses Datadog, instruments all apps with Datadog APM libraries
2022: Datadog raises prices 40%. Company X evaluates alternatives.
Discovery:
- 200 microservices instrumented with
dd-tracelibrary - Custom dashboards use Datadog-specific query language
- Alerts depend on Datadog metric names
- Estimated Migration Cost: 6 months, $500K+ engineering time
Outcome: Stuck with Datadog despite cost increase (lock-in worked as designed)
The Solution: OpenTelemetry as the Standard
What is OpenTelemetry?
OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework that provides:
- Standardized APIs - Uniform way to instrument code
- SDKs for all languages - Java, Python, Go, Node.js, .NET, Ruby, PHP, etc.
- Automatic instrumentation - Zero-code instrumentation for frameworks
- Collector - Vendor-agnostic telemetry pipeline
- Export to any backend - Prometheus, Grafana, Datadog, Jaeger, etc.
Created by: CNCF (merger of OpenTracing and OpenCensus) Status: CNCF Graduated (highest maturity) Backed by: Google, Microsoft, AWS, Splunk, Datadog, New Relic, etc.
The Key Insight: Get vendors to agree on a standard, then compete on backends, not agents.
The OpenTelemetry Architecture
graph TB
A[Application Code] -->|OTel SDK| B[OTel Collector]
B --> C[Prometheus]
B --> D[Grafana Tempo]
B --> E[Loki]
B --> F[Optional: Datadog]
B --> G[Optional: New Relic]
style B fill:#4CAF50
style C fill:#2196F3
style D fill:#2196F3
style E fill:#2196F3
How It Works:
-
Instrument Once
-
App uses OTel SDK (vendor-neutral)
-
Emits metrics, logs, traces in OTLP format (OpenTelemetry Protocol)
-
Centralized Collection
-
OTel Collector receives all telemetry
-
Runs as DaemonSet (one per node) or sidecar
-
Flexible Export
- Collector transforms and forwards to backends
- Want to switch from Datadog to Prometheus? Change collector config, not app code
- Want to send to multiple backends? Configure multiple exporters
Benefits:
- ✅ Single agent - OTel Collector replaces all vendor agents
- ✅ No vendor lock-in - Switch backends without code changes
- ✅ Consistent data - Same telemetry regardless of backend
- ✅ Future-proof - Industry standard, not vendor-specific
Fawkes Telemetry Stack
Metrics: Prometheus (via OTel Collector) Logs: Loki (via OTel Collector) Traces: Grafana Tempo (via OTel Collector) Visualization: Grafana (unified dashboard)
All data flows through OTel Collector:
# OTel Collector config
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
# Remove sensitive data
attributes:
actions:
- key: password
action: delete
- key: token
action: delete
exporters:
prometheus:
endpoint: "prometheus:9090"
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: false
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch, attributes]
exporters: [prometheus]
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [otlp/tempo]
logs:
receivers: [otlp]
processors: [batch, attributes]
exporters: [loki]
Effect: Single configuration controls all telemetry routing.
Trade-Offs: Standardization vs. Vendor Features
What OpenTelemetry Gives You
| Benefit | Impact |
|---|---|
| Vendor Neutrality | Switch backends without re-instrumenting code |
| Cost Control | Evaluate vendors on backend merit, not ecosystem lock-in |
| Single Agent | Reduce resource overhead by 50-75% (1 collector vs. 4 agents) |
| Consistent Data Model | Metrics/logs/traces use same attributes, same naming |
| Community Support | CNCF backing means long-term sustainability |
| Auto-Instrumentation | Zero-code instrumentation for popular frameworks |
| Cross-Cloud | Works on AWS, Azure, GCP, on-prem identically |
What OpenTelemetry Costs You
| Challenge | Mitigation |
|---|---|
| Missing Vendor Features | Vendor agents have proprietary features (e.g., Datadog Live Profiling). Mitigation: Use OTel for core observability, add vendor agent only for specific features if justified |
| Configuration Complexity | OTel Collector config can be verbose. Mitigation: Fawkes provides templates and Helm charts for common patterns |
| Learning Curve | New standard to learn vs. familiar vendor tools. Mitigation: Dojo Module 13: Observability covers OTel |
| Maturity Gaps | Some language SDKs less mature than vendor equivalents. Mitigation: Java, Go, Python, Node.js are production-ready (Fawkes primary languages) |
| Backend Performance | Prometheus/Tempo/Loki may not match Datadog polish initially. Mitigation: Acceptable trade-off for avoiding lock-in; Grafana ecosystem rapidly improving |
When to Use Vendor Agents Anyway
OpenTelemetry is the default, but vendor agents allowed for specific use cases:
Use OTel When (90% of cases):
- ✅ Standard metrics, logs, traces
- ✅ Application performance monitoring
- ✅ Distributed tracing
- ✅ Custom instrumentation
Use Vendor Agent When:
- ⚠️ Need proprietary features (e.g., Datadog Real User Monitoring, synthetic tests)
- ⚠️ Backend vendor requires their agent for full functionality
- ⚠️ Compliance/legal requirement mandates specific vendor
Hybrid Approach: Run OTel Collector for core telemetry, add vendor sidecar only where needed.
Practical Implications
For Developers
Q: How do I instrument my application?
A: Auto-Instrumentation (Preferred)
Most frameworks supported automatically:
Java (Spring Boot, Quarkus, Micronaut):
# Download OTel Java agent
curl -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar -o agent.jar
# Run app with agent
java -javaagent:agent.jar \
-Dotel.service.name=my-service \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-jar app.jar
Node.js (Express, Fastify, NestJS):
// tracing.js
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-grpc");
const sdk = new NodeSDK({
serviceName: "my-service",
traceExporter: new OTLPTraceExporter({
url: "http://otel-collector:4317",
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Python (Flask, Django, FastAPI):
# Install OTel packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
# Auto-instrument
opentelemetry-bootstrap -a install
# Run with instrumentation
opentelemetry-instrument \
--service_name=my-service \
--exporter_otlp_endpoint=http://otel-collector:4317 \
python app.py
No code changes required! Framework calls automatically traced.
Q: What if I need custom spans?
A: Manual Instrumentation
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.amount", order.total)
# Business logic
validate_order(order_id)
charge_payment(order_id)
ship_order(order_id)
span.set_attribute("order.status", "completed")
Result: Custom span appears in Grafana Tempo with attributes.
For Platform Engineers
Q: How do I deploy the OTel Collector?
A: DaemonSet (Recommended)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: observability
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.88.0
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
volumeMounts:
- name: config
mountPath: /etc/otel
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: config
configMap:
name: otel-collector-config
Q: How do I handle high cardinality?
A: Sampling and Aggregation
# Tail-based sampling (keep interesting traces)
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 1000
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
Effect: Keep 100% of errors/slow requests, sample 10% of fast successful requests.
For Consultants
Client Question: "Why not just use Datadog/New Relic? It's easier."
Response:
-
Short-Term vs. Long-Term
-
Short-Term: Vendor agent easier (pre-built dashboards, support)
-
Long-Term: Vendor lock-in, pricing pressure, limited flexibility
-
Cost Control
-
Datadog pricing: $15-31/host/month + $1.27/million spans
- 100 hosts, 10M spans/day = $50K+/year
-
OTel + Grafana Cloud: $0-8K/year (or self-hosted = $0)
-
Strategic Flexibility
-
Want to switch vendors? OTel makes it a config change
- Vendor raises prices? No re-instrumentation needed
-
Multi-cloud? Same telemetry stack everywhere
-
Industry Momentum
- CNCF Graduated, backed by all major vendors
- Even Datadog supports OTel (they know it's the future)
Analogy: "Would you write SQL queries in Oracle PL/SQL or ANSI SQL? ANSI SQL works across databases. OTel is the ANSI SQL of observability."
Objection Handling:
- "OTel is immature" → Graduated project, 100+ companies contributing
- "We lose vendor features" → Hybrid approach: OTel + vendor agent only where needed
- "Our team knows Datadog" → Training investment pays off via flexibility
Historical Context: From Vendor Agents to OTel
Fawkes 1.0 (2022): Prometheus + Vendor Sprawl
Metrics: Prometheus exporters (custom per service) Logs: Fluentd + Elasticsearch Traces: Jaeger (limited adoption)
Problems:
- Developers didn't instrument traces (too complex)
- Logs disconnected from metrics/traces
- No unified view of system state
- MTTR: High (couldn't correlate data across tools)
Fawkes 2.0 (2023): OpenTelemetry Adoption
Decision: Bet on OpenTelemetry as standard
Migration:
- Deploy OTel Collector (DaemonSet)
- Auto-instrument new services with OTel SDKs
- Migrate existing services incrementally
- Sunset Jaeger, direct Prometheus exporters
Results:
- Unified data model: Metrics, logs, traces use same attributes
- Correlation: Trace ID in logs, link from Grafana to Tempo
- Developer adoption: Auto-instrumentation = zero friction
- MTTR: ⬇️ 40% (faster root cause analysis)
- Cost: ⬇️ $30K/year (avoided Datadog expansion)
Fawkes 3.0 (2024-Present): Grafana LGTM Stack
Full Observability Stack:
- Loki (Logs)
- Grafana (Visualization)
- Tempo (Traces)
- Mimir/Prometheus (Metrics)
All powered by OTel Collector as the telemetry ingestion layer.
The Mental Model: OTel as the Telemetry Layer
Think of OpenTelemetry like the OSI Network Model:
| Layer | Network Model | Telemetry Model |
|---|---|---|
| Application | HTTP, gRPC | Business Logic |
| Presentation | TLS, Compression | OTel SDK (instrumentation) |
| Session | Authentication | Trace Context Propagation |
| Transport | TCP, UDP | OTel Collector (routing, batching) |
| Network | IP | OTLP Protocol |
| Data Link | Ethernet | Backend Protocol (Prometheus, Tempo) |
| Physical | Cables | Storage (Prometheus TSDB, Tempo blocks) |
Key Insight: You don't write TCP socket code to make HTTP requests—you use a library. Similarly, you don't use vendor-specific agents—you use OTel SDK and let the collector handle backend details.
This is abstraction working as intended.
Related Reading
- How-To: Trace Requests with Tempo
- How-To: View DORA Metrics
- ADR: ADR-013: Distributed Tracing
- ADR: ADR-011: Centralized Log Management
- ADR: ADR-012: Metrics Monitoring
- Tutorial: Module 13: Observability
- Reference: OpenTelemetry Documentation
Conclusion
OpenTelemetry is not just a tool—it's a strategic architectural decision that prevents vendor lock-in while enabling world-class observability.
The Core Principle: Instrument once with a vendor-neutral standard, then send telemetry to any backend. This gives you:
- Flexibility: Switch backends without code changes
- Cost Control: Vendors compete on backend quality, not ecosystem lock-in
- Future-Proofing: Industry standard, backed by CNCF and all major vendors
- Simplicity: One agent (OTel Collector) instead of many vendor agents
The Trade-Off: You may lose some proprietary vendor features, but you gain strategic independence. For Fawkes, that's the right trade.
The Future: As Datadog, New Relic, and others adopt OTel (they already are), the ecosystem converges on the standard. Choosing OTel today means you're ahead of the curve, not behind it.
Unified Telemetry = OpenTelemetry. This is the way.