Centralized Log Management with OpenTelemetry

Overview

Fawkes implements centralized, structured logging for all Kubernetes workloads using OpenTelemetry Collector and OTLP. This enables reliable correlation of log events with traces and metrics, accelerating Mean Time to Resolution (MTTR) for application and platform incidents.

Reference: See ADR-011 Centralized Log Management for architectural decisions.

Architecture

┌────────────────────────────────────────────────────────────────┐
│  Kubernetes Cluster                                            │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│  │ Application  │  │ Application  │  │ Platform     │        │
│  │ Pod          │  │ Pod          │  │ Service Pod  │        │
│  │              │  │              │  │              │        │
│  │ stdout/stderr│  │ stdout/stderr│  │ stdout/stderr│        │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘        │
│         │                  │                  │                 │
│         ▼                  ▼                  ▼                 │
│  ┌──────────────────────────────────────────────────────┐     │
│  │ /var/log/containers/*.log                            │     │
│  └──────────────────────────────────────────────────────┘     │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────────────────────────────────────────────┐      │
│  │ OpenTelemetry Collector DaemonSet                   │      │
│  │ - filelog receiver: Tail container logs             │      │
│  │ - k8sattributes: Add Kubernetes metadata            │      │
│  │ - otlphttp exporter: Send to OpenSearch             │      │
│  │ - memory_limiter: Buffer during outages             │      │
│  └───────────────────┬─────────────────────────────────┘      │
│                      │ OTLP/HTTP                                │
└──────────────────────┼──────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│  OpenSearch Cluster                                              │
│  - Centralized log storage and search                           │
│  - OpenSearch Dashboards for visualization                      │
└──────────────────────────────────────────────────────────────────┘

Key Features

1. Log Forwarding

The OpenTelemetry Collector Agent runs as a DaemonSet on every node and collects logs from all containers:

Source: /var/log/containers/*.log
Format: Kubernetes container log format (CRI/Docker JSON)
Receiver: filelog receiver with custom operators

2. Kubernetes Context Enrichment

Every log record is enriched with Kubernetes metadata via the k8sattributes processor:

Attribute	Description
`k8s.pod.name`	Pod name
`k8s.namespace.name`	Namespace name
`k8s.container.name`	Container name
`k8s.deployment.name`	Deployment name (if applicable)
`k8s.node.name`	Node where pod is running
`cluster`	Cluster identifier
`environment`	Environment (development, staging, production)

3. Trace Correlation

Logs are automatically correlated with traces when applications use OpenTelemetry instrumentation:

traceId: 32-character hexadecimal trace identifier
spanId: 16-character hexadecimal span identifier

This enables seamless navigation from logs to traces in observability tools.

4. Failure Handling

The collector is configured for resilience during backend unavailability:

Memory Limiter: 800MiB limit with 200MiB spike limit
Batch Processor: Queue-based batching for efficient export
Retry on Failure: Automatic retries with exponential backoff
Sending Queue: 5000-item queue for 5+ minute buffering

Configuration

OpenTelemetry Collector

The collector configuration is defined in:

platform/apps/opentelemetry/otel-collector-application.yaml

Key configuration sections:

Filelog Receiver

filelog:
  include:
    - /var/log/containers/*.log
  exclude:
    - /var/log/containers/*otel-collector*.log
  start_at: end
  include_file_path: true

K8s Attributes Processor

k8sattributes:
  extract:
    metadata:
      - k8s.namespace.name
      - k8s.pod.name
      - k8s.container.name
      - k8s.deployment.name
      # ... more attributes

OpenSearch Exporter

opensearch:
  http:
    endpoint: "http://opensearch-cluster-master.logging.svc.cluster.local:9200"
  logs_index: "otel-logs"
  retry:
    enabled: true
    max_elapsed_time: 300s
  sending_queue:
    enabled: true
    queue_size: 5000

OpenSearch Index Templates

Index templates are defined in:

platform/apps/opensearch/index-template.yaml

Templates ensure proper mapping for:

OTLP log format (otel-logs-*)
Application logs (application-logs-*)
Kubernetes logs (kubernetes-logs-*)

Usage

Searching Logs in OpenSearch

Access OpenSearch Dashboards and use these query patterns:

Find logs from a specific namespace:

resource.attributes.k8s.namespace.name: "my-namespace"

Find logs with a specific trace ID:

traceId: "<your-32-character-trace-id>"

Find error logs from a deployment:

resource.attributes.k8s.deployment.name: "my-app" AND severityText: "ERROR"

Structured Logging Best Practices

For optimal trace correlation, applications should emit structured JSON logs:

{
  "@timestamp": "2024-12-07T10:30:00.123Z",
  "level": "INFO",
  "message": "User login successful",
  "traceId": "<32-character-hex-trace-id>",
  "spanId": "<16-character-hex-span-id>",
  "userId": "user-123"
}

Required Fields

Field	Type	Description
`level`	keyword	Log level (ERROR, WARN, INFO, DEBUG)
`message`	text	Human-readable message
`traceId`	keyword	OpenTelemetry trace ID (optional)
`spanId`	keyword	OpenTelemetry span ID (optional)

Environment Variables

The following environment variables are used by the collector:

Variable	Description
`K8S_NODE_NAME`	Current node name (auto-populated)
`NODE_IP`	Node IP address (auto-populated)

Monitoring

Collector Health

Check collector health:

kubectl port-forward -n monitoring daemonset/otel-collector 13133:13133
curl http://localhost:13133/

ZPages for Debugging

Access collector internal diagnostics:

kubectl port-forward -n monitoring daemonset/otel-collector 55679:55679
# Open http://localhost:55679/debug/tracez

Metrics

The collector exposes Prometheus metrics at port 8888:

otelcol_receiver_accepted_log_records: Logs received
otelcol_exporter_sent_log_records: Logs exported
otelcol_exporter_queue_size: Current queue size

Troubleshooting

Logs Not Appearing in OpenSearch

Check collector pods are running:

kubectl get pods -n monitoring -l app.kubernetes.io/name=opentelemetry-collector

Check collector logs:

kubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector --tail=100

Verify OpenSearch connectivity:

kubectl exec -n monitoring -it <collector-pod> -- curl http://opensearch-cluster-master.logging.svc.cluster.local:9200/_cluster/health

Missing Kubernetes Attributes

Ensure the collector service account has proper RBAC permissions to read pod metadata.

High Memory Usage

If the collector is using too much memory, adjust the memory_limiter settings or increase resource limits.