Fawkes Dojo Module 13: Monitoring, Observability & DORA Metrics

Module Overview

Duration: 3-4 hours Level: Advanced Prerequisites: Modules 1-4, Working Fawkes deployment, Basic understanding of Kubernetes and CI/CD

Learning Objectives

By the end of this module, you will be able to:

Implement comprehensive monitoring and observability for your Fawkes platform
Configure and customize dashboards for platform health and performance
Measure and track the Four Key DORA metrics
Set up alerting and incident response workflows
Use observability data to drive continuous improvement
Implement distributed tracing for application performance monitoring

Part 1: Understanding Observability in Platform Engineering

The Three Pillars of Observability

Metrics: Numerical measurements over time

Infrastructure metrics (CPU, memory, disk, network)
Application metrics (request rate, error rate, latency)
Business metrics (deployments, lead time, failure rate)

Logs: Event records from systems and applications

Structured vs. unstructured logs
Log aggregation and centralization
Log levels and filtering

Traces: Request flows through distributed systems

Distributed tracing concepts
Span and trace relationships
Performance bottleneck identification

Why Observability Matters for DORA

The Four Key Metrics require robust observability:

Deployment Frequency: Track deployments through CI/CD events
Lead Time for Changes: Measure from commit to production
Change Failure Rate: Monitor deployment failures and rollbacks
Mean Time to Restore (MTTR): Detect and measure incident resolution time

Part 2: Fawkes Monitoring Stack

Components Overview

Fawkes includes an integrated monitoring stack:

┌─────────────────────────────────────────────┐
│           Grafana Dashboards                │
│        (Visualization & Alerting)           │
└──────────────┬──────────────────────────────┘
               │
       ┌───────┴────────┐
       │                │
┌──────▼──────┐  ┌─────▼──────┐
│ Prometheus  │  │    Loki    │
│  (Metrics)  │  │   (Logs)   │
└──────┬──────┘  └─────┬──────┘
       │                │
┌──────▼────────────────▼──────┐
│      Node Exporters          │
│   Application Exporters      │
│      Fluent Bit/Promtail     │
└──────────────┬────────────────┘
               │
    ┌──────────▼───────────┐
    │  Kubernetes Cluster  │
    │    Applications      │
    └──────────────────────┘

Included Tools

Prometheus: Metrics collection and storage
Grafana: Dashboard visualization and alerting
Loki: Log aggregation (lightweight alternative to ELK)
Tempo: Distributed tracing (optional)
AlertManager: Alert routing and notification
Node Exporter: Infrastructure metrics
kube-state-metrics: Kubernetes object metrics

Part 3: Hands-On Lab - Deploying the Monitoring Stack

Lab Setup

Scenario: You have a running Fawkes platform on Kubernetes. Now you'll deploy the full monitoring stack and configure dashboards.

Step 1: Deploy Monitoring Components

# Navigate to the platform monitoring directory
cd fawkes/platform/monitoring

# Deploy Prometheus Operator and stack
kubectl create namespace monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (includes Prometheus, Grafana, AlertManager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=admin123 \
  -f values/prometheus-values.yaml

Step 2: Deploy Loki for Log Aggregation

# Install Loki
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=50Gi

Step 3: Configure Data Sources in Grafana

# Get Grafana admin password
kubectl get secret --namespace monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

# Port-forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Visit http://localhost:3000 and log in with admin credentials.

Add Loki Data Source:

Go to Configuration → Data Sources
Add data source → Loki
URL: http://loki:3100
Save & Test

Step 4: Verify Metrics Collection

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Open http://localhost:9090/targets
# Verify all targets are "UP"

Expected targets:

kubernetes-apiservers
kubernetes-nodes
kubernetes-pods
kubernetes-service-endpoints
kube-state-metrics
node-exporter

Part 4: Configuring DORA Metrics Dashboards

Creating Custom Metrics

To track DORA metrics, we need to instrument our CI/CD pipeline to emit custom metrics.

Deployment Frequency Metric

Add to your CI/CD pipeline (e.g., Jenkins, GitLab CI, ArgoCD):

# Example: Prometheus metrics endpoint in your deployment controller
apiVersion: v1
kind: ConfigMap
metadata:
  name: deployment-metrics
  namespace: fawkes-platform
data:
  record-deployment.sh: |
    #!/bin/bash
    # Record deployment event
    cat <<EOF | curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/deployments
    # TYPE deployment_total counter
    # HELP deployment_total Total number of deployments
    deployment_total{environment="$ENV",application="$APP",status="$STATUS"} 1
    EOF

Lead Time for Changes

Track commit-to-deployment time:

# Example Python script to calculate lead time
from prometheus_client import Gauge, push_to_gateway
import os
from datetime import datetime

lead_time_gauge = Gauge('lead_time_seconds', 'Time from commit to deployment', ['application', 'environment'])

def record_lead_time(commit_timestamp, deploy_timestamp, app, env):
    lead_time = (deploy_timestamp - commit_timestamp).total_seconds()
    lead_time_gauge.labels(application=app, environment=env).set(lead_time)
    push_to_gateway('prometheus-pushgateway:9091', job='lead_time', registry=registry)

Change Failure Rate

Monitor deployment failures and rollbacks:

# In your deployment script, record success/failure
deployment_status="success"  # or "failure"

cat <<EOF | curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/deployment_results
deployment_result{application="$APP",environment="$ENV",status="$deployment_status"} 1
EOF

MTTR (Mean Time to Restore)

Use AlertManager and incident tracking:

# PromQL query for MTTR
rate(alert_duration_seconds_sum[7d]) / rate(alert_duration_seconds_count[7d])

Import DORA Dashboard

Create a Grafana dashboard (dora-metrics-dashboard.json):

{
  "dashboard": {
    "title": "DORA Four Key Metrics",
    "panels": [
      {
        "title": "Deployment Frequency",
        "targets": [
          {
            "expr": "sum(rate(deployment_total[1d])) by (environment)"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Lead Time for Changes (Average)",
        "targets": [
          {
            "expr": "avg(lead_time_seconds) by (application)"
          }
        ],
        "type": "stat"
      },
      {
        "title": "Change Failure Rate",
        "targets": [
          {
            "expr": "sum(rate(deployment_result{status='failure'}[7d])) / sum(rate(deployment_result[7d])) * 100"
          }
        ],
        "type": "gauge"
      },
      {
        "title": "Mean Time to Restore (MTTR)",
        "targets": [
          {
            "expr": "avg(alert_duration_seconds) by (severity)"
          }
        ],
        "type": "stat"
      }
    ]
  }
}

Import into Grafana:

# Import dashboard
curl -X POST http://admin:admin123@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @dora-metrics-dashboard.json

Part 5: Alerting and Incident Response

Configuring AlertManager

Edit AlertManager configuration:

# alertmanager-config.yaml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

route:
  group_by: ["alertname", "cluster", "service"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: "default"
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-critical"
    - match:
        severity: warning
      receiver: "slack-warnings"

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts"
        title: "Alert: {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_KEY"

  - name: "slack-warnings"
    slack_configs:
      - channel: "#warnings"
        title: "Warning: {{ .GroupLabels.alertname }}"

Apply configuration:

kubectl create secret generic alertmanager-config \
  --from-file=alertmanager.yaml=alertmanager-config.yaml \
  -n monitoring

kubectl rollout restart statefulset/alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring

Creating Alert Rules

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fawkes-platform-alerts
  namespace: monitoring
spec:
  groups:
    - name: platform_health
      interval: 30s
      rules:
        - alert: HighPodCrashRate
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High pod crash rate detected"
            description: "Pod {{ $labels.pod }} is crash-looping"

        - alert: DeploymentFailed
          expr: increase(deployment_result{status="failure"}[5m]) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Deployment failure detected"
            description: "Deployment for {{ $labels.application }} failed"

        - alert: HighChangeFailureRate
          expr: |
            sum(rate(deployment_result{status="failure"}[7d]))
            / sum(rate(deployment_result[7d])) * 100 > 15
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Change failure rate exceeds 15%"
            description: "Current CFR: {{ $value }}%"

        - alert: LowDeploymentFrequency
          expr: sum(rate(deployment_total[1d])) < 0.1
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Deployment frequency is low"
            description: "Less than 1 deployment per 10 days"

        - alert: HighLeadTime
          expr: avg(lead_time_seconds) > 86400
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Lead time exceeds 24 hours"
            description: "Average lead time: {{ $value }}s"

Apply rules:

kubectl apply -f prometheus-rules.yaml

Part 6: Application Performance Monitoring with Tracing

Deploy Tempo for Distributed Tracing

# Install Tempo
helm install tempo grafana/tempo \
  --namespace monitoring \
  --set persistence.enabled=true

Instrument Your Application

Example using OpenTelemetry (Java Spring Boot):

<!-- pom.xml -->
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
    <version>1.32.0</version>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
    <version>1.32.0</version>
</dependency>

// Application configuration
@Configuration
public class TracingConfig {
    @Bean
    public OpenTelemetry openTelemetry() {
        OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
            .setEndpoint("http://tempo:4317")
            .build();

        SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
            .addSpanProcessor(BatchSpanProcessor.builder(spanExporter).build())
            .build();

        return OpenTelemetrySdk.builder()
            .setTracerProvider(tracerProvider)
            .buildAndRegisterGlobal();
    }
}

Configure Tempo in Grafana

Add Tempo data source in Grafana
URL: http://tempo:3100
Enable trace to logs correlation with Loki

Part 7: Log Analysis and Troubleshooting

Effective Log Queries with LogQL

Find errors in the last hour:

{namespace="fawkes-platform"} |= "ERROR" | json | line_format "{{.timestamp}} {{.level}} {{.message}}"

Track deployment events:

{job="deployment-controller"} |= "deployment" | json | status="success"

Analyze slow requests:

{app="api-gateway"} | json | duration > 1000 | line_format "Slow request: {{.path}} took {{.duration}}ms"

Creating Log-Based Alerts

# Grafana alert from logs
- alert: HighErrorRate
  expr: |
    sum(rate({namespace="fawkes-platform"} |= "ERROR" [5m]))
    > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate in platform logs"

Part 8: Dashboarding Best Practices

Dashboard Design Principles

Top-down approach: Overall health → Specific components
Red method for services: Rate, Errors, Duration
USE method for resources: Utilization, Saturation, Errors
Actionable metrics: Every panel should inform decisions
Consistent time ranges: Synchronize across panels

Example Platform Health Dashboard Structure

┌─────────────────────────────────────────────┐
│  Overall Platform Health (Single stat)      │
│  ● Cluster Status  ● Deployments  ● Alerts │
└─────────────────────────────────────────────┘

┌──────────────────────┐ ┌───────────────────┐
│  Deployment Freq.    │ │  Lead Time Trend  │
│  (Time series)       │ │  (Time series)    │
└──────────────────────┘ └───────────────────┘

┌──────────────────────┐ ┌───────────────────┐
│  Change Failure %    │ │  MTTR (Avg)       │
│  (Gauge)             │ │  (Stat)           │
└──────────────────────┘ └───────────────────┘

┌─────────────────────────────────────────────┐
│  Recent Deployments (Table)                 │
│  Time | App | Env | Status | Duration       │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Active Alerts (Table)                      │
│  Severity | Alert | Time | Status           │
└─────────────────────────────────────────────┘

Part 9: Practical Exercise

Exercise: Complete Observability Implementation

Objective: Implement end-to-end observability for a sample application deployed on Fawkes.

Steps:

Deploy Sample Application

kubectl apply -f exercises/sample-app/

Configure Application Metrics
Expose Prometheus metrics endpoint
Add custom business metrics
Verify scraping in Prometheus
Set Up Logging
Ensure structured JSON logs
Verify logs appear in Loki
Create useful log queries
Create Dashboard
Import base dashboard template
Add custom panels for your app
Configure variables for filtering
Configure Alerts
Create alert for high error rate
Create alert for deployment failures
Test alert firing and resolution
Implement Tracing
Add OpenTelemetry instrumentation
Generate sample traces
Correlate traces with logs
Measure DORA Metrics
Deploy multiple times
Introduce a failure
Calculate all four metrics
Identify improvement opportunities

Validation Checklist:

[ ] Application metrics visible in Prometheus
[ ] Logs searchable in Grafana/Loki
[ ] Dashboard shows real-time data
[ ] Alerts fire and resolve correctly
[ ] Traces show request flows
[ ] DORA metrics calculated and displayed

Part 10: Advanced Topics

Cost Optimization

Reduce metric cardinality:

# Drop unnecessary labels
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "go_.*"
    action: drop

Adjust retention:

# Shorter retention for high-volume metrics
- record: aggregated:deployment_total:sum
  expr: sum(rate(deployment_total[5m])) by (environment)

High Availability Setup

# Prometheus HA with Thanos
prometheus:
  prometheusSpec:
    replicas: 2
    thanos:
      image: quay.io/thanos/thanos:v0.32.0
      objectStorageConfig:
        secret: thanos-objstore-config

Multi-Cluster Monitoring

# Use Thanos or Cortex for cross-cluster metrics
helm install thanos bitnami/thanos \
  --set query.enabled=true \
  --set storegateway.enabled=true

Part 11: Troubleshooting Common Issues

Prometheus Not Scraping Targets

Symptom: Targets show as "DOWN" in Prometheus

Solution:

# Check ServiceMonitor configuration
kubectl get servicemonitors -n monitoring

# Verify service selector matches
kubectl describe servicemonitor <name> -n monitoring

# Check network policies
kubectl get networkpolicies -n monitoring

High Cardinality Problems

Symptom: Prometheus using excessive memory

Solution:

# Identify high-cardinality metrics
curl http://localhost:9090/api/v1/status/tsdb | jq .

# Drop or aggregate problematic metrics
# Add to prometheus-values.yaml

Missing Logs in Loki

Symptom: No logs appearing in Grafana

Solution:

# Check Promtail is running
kubectl get pods -n monitoring -l app=promtail

# Verify Promtail configuration
kubectl logs -n monitoring -l app=promtail

# Check Loki ingester
kubectl logs -n monitoring -l app=loki -c ingester

Part 12: Summary and Next Steps

Key Takeaways

Observability is critical for platform reliability and DORA metrics
The three pillars (metrics, logs, traces) provide complementary insights
Automation of metric collection reduces manual work
Dashboards should be actionable and inform decisions
Alerting requires tuning to avoid fatigue
DORA metrics drive continuous improvement

Measuring Success

After completing this module, you should have:

✅ Working Prometheus + Grafana + Loki stack
✅ Custom dashboards for DORA metrics
✅ Configured alerts for platform health
✅ Application instrumentation for tracing
✅ Log aggregation and search capability
✅ Understanding of observability best practices

Continuous Improvement

Weekly Activities:

Review DORA metrics trends
Analyze alert patterns
Optimize slow queries
Update dashboards based on team feedback

Monthly Activities:

Review and adjust alert thresholds
Archive old metrics data
Update documentation
Train team members on new features

Additional Resources

Module Assessment

Knowledge Check Questions

What are the three pillars of observability?
How do you calculate the Change Failure Rate?
What's the difference between metrics and traces?
When should you use Prometheus vs. Loki?
What is metric cardinality and why does it matter?
How do you correlate traces with logs in Grafana?
What's the purpose of AlertManager's grouping?
How can you reduce monitoring costs?

Practical Assessment

Complete the following tasks:

Deploy a complete monitoring stack
Create a custom dashboard with DORA metrics
Configure three meaningful alerts
Instrument an application with tracing
Write five useful LogQL queries
Generate a weekly DORA metrics report

Bonus Challenge

Implement a complete observability solution for a multi-service application that:

Tracks deployments across three environments
Correlates traces across microservices
Provides SLO/SLA dashboards
Alerts on DORA metric degradation
Exports metrics to external systems

Appendix A: Metric Examples Reference

Infrastructure Metrics

# Node CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Node memory usage
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk usage
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

Kubernetes Metrics

# Pod restart rate
rate(kube_pod_container_status_restarts_total[1h])

# Deployment replicas available
kube_deployment_status_replicas_available / kube_deployment_spec_replicas

# Node readiness
kube_node_status_condition{condition="Ready",status="true"}

Application Metrics

# Request rate (RED method)
sum(rate(http_requests_total[5m])) by (service)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Request duration (p95)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Appendix B: Dashboard JSON Templates

See the Fawkes repository for complete dashboard templates:

dashboards/platform-overview.json
dashboards/dora-metrics.json
dashboards/application-health.json
dashboards/infrastructure.json

Feedback and Contributions

Have feedback on this module? Found errors or want to suggest improvements?

Open an issue: https://github.com/paruff/fawkes/issues
Submit a PR: https://github.com/paruff/fawkes/pulls
Join discussions: https://github.com/paruff/fawkes/discussions

Module 5 Complete! 🎉

You now have the knowledge to implement comprehensive observability and measure DORA metrics for your Fawkes platform. Continue to Module 6 for advanced platform operations and troubleshooting.

Next Module Preview: Module 6 - Platform Operations & Advanced Troubleshooting