ADR-011: Centralized Log Management
Status
Accepted
Context
The Fawkes platform requires comprehensive log management for both platform services and applications deployed by teams:
Platform Service Logs:
- Kubernetes control plane (API server, scheduler, controller manager, etcd)
- NGINX Ingress Controller (access logs, error logs)
- ArgoCD (deployment events, sync operations, application health)
- Jenkins (build logs, pipeline execution, agent activities)
- Backstage (catalog operations, template scaffolding, API requests)
- Mattermost (user activity, integrations, webhooks)
- Harbor (registry operations, image scanning, vulnerability reports)
- Grafana (dashboard access, alert notifications, data source queries)
- Prometheus (scrape operations, rule evaluations, alert firing)
- PostgreSQL (query logs, connection logs, errors)
- External Secrets Operator (secret synchronization, errors)
Application Logs (from teams using Fawkes):
- Microservice application logs (structured and unstructured)
- Container stdout/stderr
- Application performance metrics
- Error and exception tracking
- Audit logs for security and compliance
- Business event logs
Logging Requirements:
- Centralized Storage: All logs in one searchable location
- Long-Term Retention: 30 days hot storage, 90+ days cold storage
- Fast Search: Sub-second queries across billions of log entries
- Structured Logging: Support for JSON and structured formats
- Log Correlation: Trace ID correlation across services
- Multi-Tenancy: Team-level log isolation and access control
- Real-Time Streaming: Live log tailing for debugging
- Alerting: Trigger alerts based on log patterns
- Visualization: Dashboard creation from log data
- Cost Efficiency: Minimize storage and compute costs
Security & Compliance Requirements:
- Encryption at rest and in transit
- Role-based access control (RBAC)
- Audit trail of log access
- PII/sensitive data masking
- Retention policies for compliance (GDPR, SOC 2)
- Immutable log storage (tamper-proof)
- Log integrity verification
Operational Requirements:
- Cloud-agnostic (works on AWS, Azure, GCP, on-premises)
- Low operational overhead (minimal maintenance)
- Automatic log collection (no application code changes)
- Handling high throughput (10,000+ logs/second)
- Graceful degradation (buffering during outages)
- Easy troubleshooting (logs about logging)
- GitOps-compatible deployment
Integration Requirements:
- Kubernetes native (DaemonSet for log collection)
- Prometheus metrics integration
- Grafana dashboard integration
- Alert manager integration
- SIEM integration capabilities
- OpenTelemetry compatibility
Dojo Learning Requirements:
- Simple enough for learners to understand
- Clear troubleshooting workflows
- Hands-on labs for log analysis
- Integration with DORA metrics (deployment events, incident response)
Decision
We will use OpenSearch as the centralized log storage and search engine, with Fluent Bit as the lightweight log collector.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Application │ │ Application │ │ Platform │ │
│ │ Pod │ │ Pod │ │ Service Pod │ │
│ │ │ │ │ │ │ │
│ │ stdout/stderr│ │ stdout/stderr│ │ stdout/stderr│ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ /var/log/containers/*.log │ │
│ │ (Kubernetes logs each container to host filesystem) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Fluent Bit DaemonSet (runs on every node) │ │
│ │ - Tail container logs │ │
│ │ - Parse and enrich (add metadata) │ │
│ │ - Filter and transform │ │
│ │ - Buffer during outages │ │
│ │ - Forward to OpenSearch │ │
│ └───────────────────┬─────────────────────────────────┘ │
│ │ │
└──────────────────────┼──────────────────────────────────────────┘
│ HTTPS (with buffering)
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ OpenSearch Cluster │
│ │
│ ┌───────────────────┐ ┌───────────────────┐ ┌─────────────┐ │
│ │ Master Node │ │ Master Node │ │ Master Node │ │
│ │ (Cluster mgmt) │ │ (Cluster mgmt) │ │ (Cluster mgmt)│
│ └───────────────────┘ └───────────────────┘ └─────────────┘ │
│ │
│ ┌───────────────────┐ ┌───────────────────┐ ┌─────────────┐ │
│ │ Data Node │ │ Data Node │ │ Data Node │ │
│ │ - Log storage │ │ - Log storage │ │ - Log storage│ │
│ │ - Indexing │ │ - Indexing │ │ - Indexing │ │
│ │ - Query execution │ │ - Query execution │ │ - Query exec│ │
│ └───────────────────┘ └───────────────────┘ └─────────────┘ │
│ │
│ Index Management: │
│ - Hot tier (last 7 days): SSD storage, fast queries │
│ - Warm tier (7-30 days): HDD storage, slower queries │
│ - Cold tier (30-90 days): S3/object storage, archive │
│ │
└───────────────────────────────────────────────────────────────────┘
│
│ Query Interface
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Visualization & Query Interfaces │
│ │
│ ┌───────────────────┐ ┌───────────────────┐ ┌─────────────┐ │
│ │ OpenSearch │ │ Grafana │ │ CLI Tools │ │
│ │ Dashboards │ │ (Loki datasource) │ │ (kubectl) │ │
│ │ - Log search │ │ - Unified view │ │ │ │
│ │ - Dashboards │ │ - Log + metrics │ │ │ │
│ │ - Alerting │ │ - Correlations │ │ │ │
│ └───────────────────┘ └───────────────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Log Flow
- Collection: Fluent Bit DaemonSet collects logs from
/var/log/containers/ - Enrichment: Add Kubernetes metadata (namespace, pod, container, labels)
- Parsing: Parse JSON logs, multiline logs (stack traces), timestamps
- Filtering: Filter out noisy logs, health checks, debug messages (configurable)
- Buffering: Buffer logs during OpenSearch unavailability
- Forwarding: Send to OpenSearch via HTTPS with authentication
- Indexing: OpenSearch indexes logs by date and namespace
- Retention: Automatically move logs to warm/cold tiers based on age
- Querying: Users search via OpenSearch Dashboards or Grafana
OpenSearch Configuration
Cluster Sizing (Production):
Master Nodes: 3 replicas
- CPU: 2 cores
- Memory: 4 GB
- Storage: 20 GB (minimal, only metadata)
- Purpose: Cluster coordination, no data
Data Nodes (Hot): 3 replicas
- CPU: 4 cores
- Memory: 16 GB (50% heap)
- Storage: 500 GB SSD
- Purpose: Recent logs (last 7 days)
Data Nodes (Warm): 2 replicas (optional for MVP)
- CPU: 2 cores
- Memory: 8 GB
- Storage: 1 TB HDD
- Purpose: Older logs (7-30 days)
Index Template:
{
"index_patterns": ["fawkes-logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.refresh_interval": "30s",
"index.lifecycle.name": "fawkes-log-policy"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"kubernetes": {
"properties": {
"namespace": { "type": "keyword" },
"pod_name": { "type": "keyword" },
"container_name": { "type": "keyword" },
"labels": { "type": "object" }
}
},
"log": { "type": "text" },
"level": { "type": "keyword" },
"message": { "type": "text" },
"trace_id": { "type": "keyword" },
"span_id": { "type": "keyword" }
}
}
}
}
Index Lifecycle Policy (ILM):
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_size": "50gb"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "fawkes-logs-s3"
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
Fluent Bit Configuration
DaemonSet Deployment:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: fawkes-logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.2
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
Fluent Bit Pipeline Configuration:
[SERVICE]
Flush 5
Daemon off
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 5MB
Skip_Long_Lines On
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels On
Annotations Off
[FILTER]
Name modify
Match *
Add cluster_name fawkes-production
[FILTER]
Name nest
Match *
Operation lift
Nested_under kubernetes
Add_prefix k8s_
[FILTER]
Name grep
Match *
Exclude log /healthz|/readyz|/livez
[OUTPUT]
Name opensearch
Match *
Host opensearch.fawkes-logging.svc.cluster.local
Port 9200
Index fawkes-logs
Type _doc
Logstash_Format On
Logstash_Prefix fawkes-logs
Logstash_DateFormat %Y.%m.%d
Suppress_Type_Name On
TLS On
TLS.Verify On
HTTP_User ${OPENSEARCH_USER}
HTTP_Passwd ${OPENSEARCH_PASSWORD}
Retry_Limit 5
Buffer_Size False
Parsing Configuration (parsers.conf):
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
Time_Keep On
[PARSER]
Name json
Format json
Time_Key timestamp
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name java_multiline
Format regex
Regex /^(?<time>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}.\d{3})\s+(?<level>[A-Z]+)\s+(?<message>.*)/
Time_Key time
Time_Format %Y-%m-%d %H:%M:%S.%L
Multi-Tenancy & Access Control
Namespace-Based Log Isolation:
apiVersion: v1
kind: Role
metadata:
name: log-reader
namespace: team-alpha
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: team-alpha-log-readers
namespace: team-alpha
subjects:
- kind: Group
name: team-alpha
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: log-reader
apiGroup: rbac.authorization.k8s.io
OpenSearch Role-Based Access:
{
"team-alpha-logs": {
"cluster_permissions": [],
"index_permissions": [
{
"index_patterns": ["fawkes-logs-*"],
"dls": "{\"term\": {\"k8s_namespace\": \"team-alpha\"}}",
"fls": [],
"masked_fields": [],
"allowed_actions": ["read"]
}
]
}
}
Grafana Integration
Loki Datasource Configuration (simulated via OpenSearch):
apiVersion: 1
datasources:
- name: OpenSearch Logs
type: grafana-opensearch-datasource
access: proxy
url: http://opensearch.fawkes-logging.svc.cluster.local:9200
basicAuth: true
basicAuthUser: grafana
secureJsonData:
basicAuthPassword: ${OPENSEARCH_GRAFANA_PASSWORD}
jsonData:
timeField: "@timestamp"
esVersion: "7.10.0"
logMessageField: log
logLevelField: level
database: "fawkes-logs-*"
Structured Logging Best Practices
Recommended Log Format (JSON):
{
"@timestamp": "2024-12-07T10:30:00.123Z",
"level": "INFO",
"logger": "com.example.UserService",
"message": "User login successful",
"trace_id": "a1b2c3d4e5f6",
"span_id": "1234567890",
"user_id": "user-123",
"ip_address": "192.168.1.100",
"duration_ms": 45,
"kubernetes": {
"namespace": "team-alpha",
"pod": "user-service-abc123",
"container": "user-service"
}
}
Fields to Always Include:
@timestamp: ISO 8601 timestamplevel: Log level (ERROR, WARN, INFO, DEBUG)message: Human-readable messagetrace_id: Distributed tracing ID (OpenTelemetry)span_id: Span ID for correlation- Context fields: user_id, request_id, correlation_id
Common Query Patterns
Search for Errors in Namespace:
k8s_namespace:"team-alpha" AND level:"ERROR"
Find Logs with Trace ID:
trace_id:"a1b2c3d4e5f6"
Logs from Specific Pod:
k8s_pod_name:"jenkins-agent-*"
Slow Requests (Duration > 1000ms):
duration_ms:>1000
Deployment Events:
message:"deployment" AND k8s_namespace:"production"
Alerting Rules
High Error Rate:
{
"trigger": {
"schedule": { "interval": "5m" },
"condition": {
"script": {
"source": "ctx.results[0].hits.total.value > 100"
}
}
},
"input": {
"search": {
"request": {
"indices": ["fawkes-logs-*"],
"body": {
"query": {
"bool": {
"must": [{ "term": { "level": "ERROR" } }, { "range": { "@timestamp": { "gte": "now-5m" } } }]
}
}
}
}
}
},
"actions": {
"slack": {
"webhook": {
"url": "https://hooks.slack.com/...",
"body": "High error rate detected: {{ctx.results.0.hits.total.value}} errors in last 5 minutes"
}
}
}
}
Service Unavailable:
{
"trigger": {
"schedule": { "interval": "1m" },
"condition": {
"script": {
"source": "ctx.results[0].hits.total.value > 0"
}
}
},
"input": {
"search": {
"request": {
"indices": ["fawkes-logs-*"],
"body": {
"query": {
"bool": {
"must": [
{ "match": { "message": "connection refused" } },
{ "term": { "k8s_namespace": "production" } },
{ "range": { "@timestamp": { "gte": "now-1m" } } }
]
}
}
}
}
}
}
}
Consequences
Positive
- Cloud Agnostic: OpenSearch works identically across AWS, Azure, GCP, on-premises
- Open Source: No licensing costs, Apache 2.0 license, community-driven
- Scalable: Handles billions of log entries, horizontal scaling via data nodes
- Fast Search: Sub-second queries across large datasets, optimized inverted indices
- Rich Query Language: SQL and DSL query support, aggregations, complex filters
- Multi-Tenancy: Document-level security for team isolation
- Cost Efficient: Tiered storage (hot/warm/cold) reduces costs significantly
- Integration Rich: Grafana, Prometheus, SIEM tools, OpenTelemetry
- Lightweight Collection: Fluent Bit minimal resource footprint (~100MB memory)
- GitOps Compatible: Declarative configuration, ArgoCD-managed
Negative
- Operational Complexity: OpenSearch cluster requires careful sizing, monitoring, tuning
- Resource Intensive: Data nodes need significant CPU/memory/storage
- Learning Curve: Query DSL, index management, cluster operations require training
- Index Management Overhead: Need to configure ILM policies, monitor shard distribution
- No Native Multi-Line Support: Requires Fluent Bit parser configuration
- Storage Costs: Hot storage on SSD can be expensive (mitigated by tiered storage)
- Backup Complexity: Snapshot repository setup, restoration testing required
Neutral
- Elasticsearch Compatibility: OpenSearch fork maintains compatibility (mostly)
- Alternative to ELK: Uses OpenSearch instead of Elasticsearch (licensing differences)
- Dashboards vs. Kibana: OpenSearch Dashboards UI similar to Kibana
- S3 Cold Storage: Requires object storage configuration for cold tier
Alternatives Considered
Alternative 1: Grafana Loki
Pros:
- Designed for Kubernetes logging (cloud-native)
- Very cost-efficient (indexes only metadata, not log content)
- Tight Grafana integration (unified metrics + logs)
- Simple deployment and operation
- LogQL query language (similar to PromQL)
- Good for high-volume, short-retention use cases
Cons:
- Limited full-text search capabilities (no inverted index)
- Less powerful query language than OpenSearch DSL
- Smaller community and ecosystem than OpenSearch/Elasticsearch
- Less suitable for compliance (long-term retention, complex queries)
- Fewer visualization options in dashboards
- Multi-tenancy requires Loki enterprise features
Reason for Rejection: While Loki is excellent for operational logging, Fawkes requires robust full-text search for debugging, compliance auditing, and security investigations. OpenSearch's powerful query DSL and proven scalability better support DORA metrics analysis and incident post-mortems. However, Loki remains a strong alternative for cost-sensitive deployments.
Alternative 2: Elastic Cloud (ELK Stack)
Pros:
- Industry standard (Elasticsearch, Logstash, Kibana)
- Massive ecosystem and community
- Extremely powerful search and analytics
- Best-in-class visualization (Kibana)
- Mature machine learning features
- Extensive documentation and training
Cons:
- Licensing concerns (Elastic License 2.0, not fully open source)
- High cost for managed service ($50-500+/month)
- Vendor lock-in potential
- Self-hosted ELK requires significant operational expertise
- Logstash resource-heavy (replaced by Fluent Bit in modern stacks)
- Complex licensing tiers (basic, gold, platinum)
Reason for Rejection: Elastic's move away from Apache 2.0 license conflicts with Fawkes' open-source principles. O