Skip to content

Incident Response Pattern

Effective incident response minimises Mean Time to Restore (MTTR) — the time from when a production incident starts until normal service is restored. DORA identifies MTTR as one of the four key metrics of software delivery performance. Elite teams restore service in less than one hour.

Incident Severity Levels

Severity Definition Response Time Examples
P1 (Critical) Complete service outage, data loss Immediate Cluster down, database corruption
P2 (High) Major feature unavailable 15 minutes Login broken, payments failing
P3 (Medium) Degraded performance 1 hour Slow responses, partial outage
P4 (Low) Minor issue Next business day UI glitch, non-critical error

Response Process

1. Detect

Grafana alerts fire when SLO thresholds are breached. The alert routes to the on-call engineer via Alertmanager → PagerDuty → Mattermost.

2. Declare

The on-call engineer declares an incident in the #incidents Mattermost channel with severity, impact scope, and initial hypothesis.

3. Diagnose

Use the runbook linked in the alert annotation. Check: - Grafana dashboards — error rate, latency, saturation - Loki logs — filter to the affected service and time window - Tempo traces — identify which service call is failing

4. Mitigate

Prioritise restoring service over finding root cause: - Rollback recent deployment (argocd app rollback <app>) - Scale up replicas to absorb load - Enable a feature flag to disable the failing feature

5. Resolve and Learn

After service is restored, write a blameless post-mortem within 48 hours. Document: - Timeline of events - Root cause - Contributing factors - Action items with owners and due dates

Post-mortems are stored in docs/runbooks/post-mortems/ and shared with the team.

Runbooks

Every P1/P2 alert must link to a runbook in docs/runbooks/. Runbooks are tested quarterly — an untested runbook is unreliable under pressure.

See Also