ADR-011: Centralized Log Management

Status

Accepted

Context

The Fawkes platform requires comprehensive log management for both platform services and applications deployed by teams:

Platform Service Logs:

Kubernetes control plane (API server, scheduler, controller manager, etcd)
NGINX Ingress Controller (access logs, error logs)
ArgoCD (deployment events, sync operations, application health)
Jenkins (build logs, pipeline execution, agent activities)
Backstage (catalog operations, template scaffolding, API requests)
Mattermost (user activity, integrations, webhooks)
Harbor (registry operations, image scanning, vulnerability reports)
Grafana (dashboard access, alert notifications, data source queries)
Prometheus (scrape operations, rule evaluations, alert firing)
PostgreSQL (query logs, connection logs, errors)
External Secrets Operator (secret synchronization, errors)

Application Logs (from teams using Fawkes):

Microservice application logs (structured and unstructured)
Container stdout/stderr
Application performance metrics
Error and exception tracking
Audit logs for security and compliance
Business event logs

Logging Requirements:

Centralized Storage: All logs in one searchable location
Long-Term Retention: 30 days hot storage, 90+ days cold storage
Fast Search: Sub-second queries across billions of log entries
Structured Logging: Support for JSON and structured formats
Log Correlation: Trace ID correlation across services
Multi-Tenancy: Team-level log isolation and access control
Real-Time Streaming: Live log tailing for debugging
Alerting: Trigger alerts based on log patterns
Visualization: Dashboard creation from log data
Cost Efficiency: Minimize storage and compute costs

Security & Compliance Requirements:

Encryption at rest and in transit
Role-based access control (RBAC)
Audit trail of log access
PII/sensitive data masking
Retention policies for compliance (GDPR, SOC 2)
Immutable log storage (tamper-proof)
Log integrity verification

Operational Requirements:

Cloud-agnostic (works on AWS, Azure, GCP, on-premises)
Low operational overhead (minimal maintenance)
Automatic log collection (no application code changes)
Handling high throughput (10,000+ logs/second)
Graceful degradation (buffering during outages)
Easy troubleshooting (logs about logging)
GitOps-compatible deployment

Integration Requirements:

Kubernetes native (DaemonSet for log collection)
Prometheus metrics integration
Grafana dashboard integration
Alert manager integration
SIEM integration capabilities
OpenTelemetry compatibility

Dojo Learning Requirements:

Simple enough for learners to understand
Clear troubleshooting workflows
Hands-on labs for log analysis
Integration with DORA metrics (deployment events, incident response)

Decision

We will use OpenSearch as the centralized log storage and search engine, with Fluent Bit as the lightweight log collector.

Architecture

┌────────────────────────────────────────────────────────────────┐
│  Kubernetes Cluster                                            │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│  │ Application  │  │ Application  │  │ Platform     │        │
│  │ Pod          │  │ Pod          │  │ Service Pod  │        │
│  │              │  │              │  │              │        │
│  │ stdout/stderr│  │ stdout/stderr│  │ stdout/stderr│        │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘        │
│         │                  │                  │                 │
│         ▼                  ▼                  ▼                 │
│  ┌──────────────────────────────────────────────────────┐     │
│  │ /var/log/containers/*.log                            │     │
│  │ (Kubernetes logs each container to host filesystem)  │     │
│  └──────────────────────────────────────────────────────┘     │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────────────────────────────────────────────┐      │
│  │ Fluent Bit DaemonSet (runs on every node)          │      │
│  │ - Tail container logs                               │      │
│  │ - Parse and enrich (add metadata)                   │      │
│  │ - Filter and transform                              │      │
│  │ - Buffer during outages                             │      │
│  │ - Forward to OpenSearch                             │      │
│  └───────────────────┬─────────────────────────────────┘      │
│                      │                                          │
└──────────────────────┼──────────────────────────────────────────┘
                       │ HTTPS (with buffering)
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│  OpenSearch Cluster                                              │
│                                                                   │
│  ┌───────────────────┐  ┌───────────────────┐  ┌─────────────┐ │
│  │ Master Node       │  │ Master Node       │  │ Master Node │ │
│  │ (Cluster mgmt)    │  │ (Cluster mgmt)    │  │ (Cluster mgmt)│
│  └───────────────────┘  └───────────────────┘  └─────────────┘ │
│                                                                   │
│  ┌───────────────────┐  ┌───────────────────┐  ┌─────────────┐ │
│  │ Data Node         │  │ Data Node         │  │ Data Node   │ │
│  │ - Log storage     │  │ - Log storage     │  │ - Log storage│ │
│  │ - Indexing        │  │ - Indexing        │  │ - Indexing  │ │
│  │ - Query execution │  │ - Query execution │  │ - Query exec│ │
│  └───────────────────┘  └───────────────────┘  └─────────────┘ │
│                                                                   │
│  Index Management:                                                │
│  - Hot tier (last 7 days): SSD storage, fast queries            │
│  - Warm tier (7-30 days): HDD storage, slower queries           │
│  - Cold tier (30-90 days): S3/object storage, archive           │
│                                                                   │
└───────────────────────────────────────────────────────────────────┘
                       │
                       │ Query Interface
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│  Visualization & Query Interfaces                                │
│                                                                   │
│  ┌───────────────────┐  ┌───────────────────┐  ┌─────────────┐ │
│  │ OpenSearch        │  │ Grafana           │  │ CLI Tools   │ │
│  │ Dashboards        │  │ (Loki datasource) │  │ (kubectl)   │ │
│  │ - Log search      │  │ - Unified view    │  │             │ │
│  │ - Dashboards      │  │ - Log + metrics   │  │             │ │
│  │ - Alerting        │  │ - Correlations    │  │             │ │
│  └───────────────────┘  └───────────────────┘  └─────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Log Flow

Collection: Fluent Bit DaemonSet collects logs from /var/log/containers/
Enrichment: Add Kubernetes metadata (namespace, pod, container, labels)
Parsing: Parse JSON logs, multiline logs (stack traces), timestamps
Filtering: Filter out noisy logs, health checks, debug messages (configurable)
Buffering: Buffer logs during OpenSearch unavailability
Forwarding: Send to OpenSearch via HTTPS with authentication
Indexing: OpenSearch indexes logs by date and namespace
Retention: Automatically move logs to warm/cold tiers based on age
Querying: Users search via OpenSearch Dashboards or Grafana

OpenSearch Configuration

Cluster Sizing (Production):

Master Nodes: 3 replicas
  - CPU: 2 cores
  - Memory: 4 GB
  - Storage: 20 GB (minimal, only metadata)
  - Purpose: Cluster coordination, no data

Data Nodes (Hot): 3 replicas
  - CPU: 4 cores
  - Memory: 16 GB (50% heap)
  - Storage: 500 GB SSD
  - Purpose: Recent logs (last 7 days)

Data Nodes (Warm): 2 replicas (optional for MVP)
  - CPU: 2 cores
  - Memory: 8 GB
  - Storage: 1 TB HDD
  - Purpose: Older logs (7-30 days)

Index Template:

{
  "index_patterns": ["fawkes-logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.refresh_interval": "30s",
      "index.lifecycle.name": "fawkes-log-policy"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "kubernetes": {
          "properties": {
            "namespace": { "type": "keyword" },
            "pod_name": { "type": "keyword" },
            "container_name": { "type": "keyword" },
            "labels": { "type": "object" }
          }
        },
        "log": { "type": "text" },
        "level": { "type": "keyword" },
        "message": { "type": "text" },
        "trace_id": { "type": "keyword" },
        "span_id": { "type": "keyword" }
      }
    }
  }
}

Index Lifecycle Policy (ILM):

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_size": "50gb"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "fawkes-logs-s3"
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Fluent Bit Configuration

DaemonSet Deployment:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: fawkes-logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:2.2
          resources:
            limits:
              cpu: 200m
              memory: 256Mi
            requests:
              cpu: 100m
              memory: 128Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: fluent-bit-config
              mountPath: /fluent-bit/etc/
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: fluent-bit-config
          configMap:
            name: fluent-bit-config

Fluent Bit Pipeline Configuration:

[SERVICE]
    Flush         5
    Daemon        off
    Log_Level     info
    Parsers_File  parsers.conf

[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               kube.*
    Refresh_Interval  5
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Kube_Tag_Prefix     kube.var.log.containers.
    Merge_Log           On
    Keep_Log            Off
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
    Labels              On
    Annotations         Off

[FILTER]
    Name         modify
    Match        *
    Add          cluster_name fawkes-production

[FILTER]
    Name         nest
    Match        *
    Operation    lift
    Nested_under kubernetes
    Add_prefix   k8s_

[FILTER]
    Name         grep
    Match        *
    Exclude      log /healthz|/readyz|/livez

[OUTPUT]
    Name            opensearch
    Match           *
    Host            opensearch.fawkes-logging.svc.cluster.local
    Port            9200
    Index           fawkes-logs
    Type            _doc
    Logstash_Format On
    Logstash_Prefix fawkes-logs
    Logstash_DateFormat %Y.%m.%d
    Suppress_Type_Name On
    TLS             On
    TLS.Verify      On
    HTTP_User       ${OPENSEARCH_USER}
    HTTP_Passwd     ${OPENSEARCH_PASSWORD}
    Retry_Limit     5
    Buffer_Size     False

Parsing Configuration (parsers.conf):

[PARSER]
    Name         docker
    Format       json
    Time_Key     time
    Time_Format  %Y-%m-%dT%H:%M:%S.%LZ
    Time_Keep    On

[PARSER]
    Name         json
    Format       json
    Time_Key     timestamp
    Time_Format  %Y-%m-%dT%H:%M:%S.%LZ

[PARSER]
    Name         java_multiline
    Format       regex
    Regex        /^(?<time>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}.\d{3})\s+(?<level>[A-Z]+)\s+(?<message>.*)/
    Time_Key     time
    Time_Format  %Y-%m-%d %H:%M:%S.%L

Multi-Tenancy & Access Control

Namespace-Based Log Isolation:

apiVersion: v1
kind: Role
metadata:
  name: log-reader
  namespace: team-alpha
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-alpha-log-readers
  namespace: team-alpha
subjects:
  - kind: Group
    name: team-alpha
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: log-reader
  apiGroup: rbac.authorization.k8s.io

OpenSearch Role-Based Access:

{
  "team-alpha-logs": {
    "cluster_permissions": [],
    "index_permissions": [
      {
        "index_patterns": ["fawkes-logs-*"],
        "dls": "{\"term\": {\"k8s_namespace\": \"team-alpha\"}}",
        "fls": [],
        "masked_fields": [],
        "allowed_actions": ["read"]
      }
    ]
  }
}

Grafana Integration

Loki Datasource Configuration (simulated via OpenSearch):

apiVersion: 1
datasources:
  - name: OpenSearch Logs
    type: grafana-opensearch-datasource
    access: proxy
    url: http://opensearch.fawkes-logging.svc.cluster.local:9200
    basicAuth: true
    basicAuthUser: grafana
    secureJsonData:
      basicAuthPassword: ${OPENSEARCH_GRAFANA_PASSWORD}
    jsonData:
      timeField: "@timestamp"
      esVersion: "7.10.0"
      logMessageField: log
      logLevelField: level
      database: "fawkes-logs-*"

Structured Logging Best Practices

Recommended Log Format (JSON):

{
  "@timestamp": "2024-12-07T10:30:00.123Z",
  "level": "INFO",
  "logger": "com.example.UserService",
  "message": "User login successful",
  "trace_id": "a1b2c3d4e5f6",
  "span_id": "1234567890",
  "user_id": "user-123",
  "ip_address": "192.168.1.100",
  "duration_ms": 45,
  "kubernetes": {
    "namespace": "team-alpha",
    "pod": "user-service-abc123",
    "container": "user-service"
  }
}

Fields to Always Include:

@timestamp: ISO 8601 timestamp
level: Log level (ERROR, WARN, INFO, DEBUG)
message: Human-readable message
trace_id: Distributed tracing ID (OpenTelemetry)
span_id: Span ID for correlation
Context fields: user_id, request_id, correlation_id

Common Query Patterns

Search for Errors in Namespace:

k8s_namespace:"team-alpha" AND level:"ERROR"

Find Logs with Trace ID:

trace_id:"a1b2c3d4e5f6"

Logs from Specific Pod:

k8s_pod_name:"jenkins-agent-*"

Slow Requests (Duration > 1000ms):

duration_ms:>1000

Deployment Events:

message:"deployment" AND k8s_namespace:"production"

Alerting Rules

High Error Rate:

{
  "trigger": {
    "schedule": { "interval": "5m" },
    "condition": {
      "script": {
        "source": "ctx.results[0].hits.total.value > 100"
      }
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": ["fawkes-logs-*"],
        "body": {
          "query": {
            "bool": {
              "must": [{ "term": { "level": "ERROR" } }, { "range": { "@timestamp": { "gte": "now-5m" } } }]
            }
          }
        }
      }
    }
  },
  "actions": {
    "slack": {
      "webhook": {
        "url": "https://hooks.slack.com/...",
        "body": "High error rate detected: {{ctx.results.0.hits.total.value}} errors in last 5 minutes"
      }
    }
  }
}

Service Unavailable:

{
  "trigger": {
    "schedule": { "interval": "1m" },
    "condition": {
      "script": {
        "source": "ctx.results[0].hits.total.value > 0"
      }
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": ["fawkes-logs-*"],
        "body": {
          "query": {
            "bool": {
              "must": [
                { "match": { "message": "connection refused" } },
                { "term": { "k8s_namespace": "production" } },
                { "range": { "@timestamp": { "gte": "now-1m" } } }
              ]
            }
          }
        }
      }
    }
  }
}

Consequences

Positive

Cloud Agnostic: OpenSearch works identically across AWS, Azure, GCP, on-premises
Open Source: No licensing costs, Apache 2.0 license, community-driven
Scalable: Handles billions of log entries, horizontal scaling via data nodes
Fast Search: Sub-second queries across large datasets, optimized inverted indices
Rich Query Language: SQL and DSL query support, aggregations, complex filters
Multi-Tenancy: Document-level security for team isolation
Cost Efficient: Tiered storage (hot/warm/cold) reduces costs significantly
Integration Rich: Grafana, Prometheus, SIEM tools, OpenTelemetry
Lightweight Collection: Fluent Bit minimal resource footprint (~100MB memory)
GitOps Compatible: Declarative configuration, ArgoCD-managed

Negative

Operational Complexity: OpenSearch cluster requires careful sizing, monitoring, tuning
Resource Intensive: Data nodes need significant CPU/memory/storage
Learning Curve: Query DSL, index management, cluster operations require training
Index Management Overhead: Need to configure ILM policies, monitor shard distribution
No Native Multi-Line Support: Requires Fluent Bit parser configuration
Storage Costs: Hot storage on SSD can be expensive (mitigated by tiered storage)
Backup Complexity: Snapshot repository setup, restoration testing required

Neutral

Elasticsearch Compatibility: OpenSearch fork maintains compatibility (mostly)
Alternative to ELK: Uses OpenSearch instead of Elasticsearch (licensing differences)
Dashboards vs. Kibana: OpenSearch Dashboards UI similar to Kibana
S3 Cold Storage: Requires object storage configuration for cold tier

Alternatives Considered

Alternative 1: Grafana Loki

Pros:

Designed for Kubernetes logging (cloud-native)
Very cost-efficient (indexes only metadata, not log content)
Tight Grafana integration (unified metrics + logs)
Simple deployment and operation
LogQL query language (similar to PromQL)
Good for high-volume, short-retention use cases

Cons:

Limited full-text search capabilities (no inverted index)
Less powerful query language than OpenSearch DSL
Smaller community and ecosystem than OpenSearch/Elasticsearch
Less suitable for compliance (long-term retention, complex queries)
Fewer visualization options in dashboards
Multi-tenancy requires Loki enterprise features

Reason for Rejection: While Loki is excellent for operational logging, Fawkes requires robust full-text search for debugging, compliance auditing, and security investigations. OpenSearch's powerful query DSL and proven scalability better support DORA metrics analysis and incident post-mortems. However, Loki remains a strong alternative for cost-sensitive deployments.

Alternative 2: Elastic Cloud (ELK Stack)

Pros:

Industry standard (Elasticsearch, Logstash, Kibana)
Massive ecosystem and community
Extremely powerful search and analytics
Best-in-class visualization (Kibana)
Mature machine learning features
Extensive documentation and training

Cons:

Licensing concerns (Elastic License 2.0, not fully open source)
High cost for managed service ($50-500+/month)
Vendor lock-in potential
Self-hosted ELK requires significant operational expertise
Logstash resource-heavy (replaced by Fluent Bit in modern stacks)
Complex licensing tiers (basic, gold, platinum)

Reason for Rejection: Elastic's move away from Apache 2.0 license conflicts with Fawkes' open-source principles. O