Monitoring and Observability Pattern
While monitoring answers "is the system working?", observability answers "why is the system not working?". An observable system exposes enough internal state that you can understand any failure from the outside, without needing to add new instrumentation every time something goes wrong.
The Three Pillars
Metrics
Metrics are numeric time-series measurements (e.g., request rate, error count,
memory usage). In Fawkes, Prometheus collects metrics from all services via
/metrics endpoints exposed by OpenTelemetry instrumentation.
Logs
Logs record discrete events. Fawkes requires structured JSON logs (structlog for
Python, zap for Go, logback for Java). Fluent Bit ships logs to Loki where
they are queryable with LogQL.
import structlog
log = structlog.get_logger()
log.info("request_processed", user_id=user.id, duration_ms=elapsed)
Traces
Distributed traces track a request's journey across service boundaries. The trace ID connects a user-facing error to the exact microservice call chain that caused it. OpenTelemetry auto-instrumentation captures spans for HTTP, database, and messaging calls with no code changes.
Observability vs Monitoring
| Aspect | Monitoring | Observability |
|---|---|---|
| Asks | Is it working? | Why is it broken? |
| Requires | Known failure modes | Arbitrary exploration |
| Tooling | Dashboards, alerts | Trace explorer, log search |
| Instrumentation | Predefined metrics | Rich structured data |
Cardinality Considerations
High-cardinality labels (like user IDs or request IDs) in Prometheus metrics will cause cardinality explosion. Use traces for high-cardinality data; use metrics only for low-cardinality aggregates.
SLOs and Error Budgets
Define Service Level Objectives (SLOs) for each user-facing service. An error budget is the allowed amount of downtime or errors before the SLO is breached. Fawkes tracks SLOs in Grafana using Prometheus recording rules.