Monitoring and Observability Pattern
Monitoring tells you when something is wrong. Observability tells you why. Elite engineering teams invest in both to detect problems before users do and to understand the root cause quickly when incidents occur.
The Four Golden Signals
Google SRE defined four golden signals that, when measured for every service, give a complete picture of health:
| Signal | What It Measures | Example Metric |
|---|---|---|
| Latency | Time to service a request | P95 HTTP response time |
| Traffic | Demand on the system | Requests per second |
| Errors | Rate of failed requests | HTTP 5xx error rate |
| Saturation | How "full" the service is | CPU utilisation, queue depth |
How Fawkes Implements Monitoring
All workloads in Fawkes emit the four golden signals automatically via OpenTelemetry
auto-instrumentation. Prometheus scrapes /metrics endpoints; Grafana renders dashboards.
# PrometheusRule (platform/apps/monitoring/rules.yaml)
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: warning
Alerting Strategy
- Page on symptoms, not causes — Alert on user-visible SLO violations (error rate, latency), not on intermediate metrics (CPU) that may not affect users.
- Every alert needs a runbook — Link the alert annotation to the relevant runbook
in
docs/runbooks/. - Silence noisy alerts — A silenced alert is better than alert fatigue that causes on-call engineers to ignore pages.
Distributed Tracing
Requests that span multiple services are traced end-to-end with OpenTelemetry. Traces are stored in Tempo and visualised in Grafana. Use the trace ID from logs to jump directly to the relevant span waterfall.
Log Aggregation
Fluent Bit collects structured JSON logs from all pods and ships them to Loki. Use LogQL in Grafana to search and correlate logs across services.