Incident Response Pattern
Effective incident response minimises Mean Time to Restore (MTTR) — the time from when a production incident starts until normal service is restored. DORA identifies MTTR as one of the four key metrics of software delivery performance. Elite teams restore service in less than one hour.
Incident Severity Levels
| Severity | Definition | Response Time | Examples |
|---|---|---|---|
| P1 (Critical) | Complete service outage, data loss | Immediate | Cluster down, database corruption |
| P2 (High) | Major feature unavailable | 15 minutes | Login broken, payments failing |
| P3 (Medium) | Degraded performance | 1 hour | Slow responses, partial outage |
| P4 (Low) | Minor issue | Next business day | UI glitch, non-critical error |
Response Process
1. Detect
Grafana alerts fire when SLO thresholds are breached. The alert routes to the on-call engineer via Alertmanager → PagerDuty → Mattermost.
2. Declare
The on-call engineer declares an incident in the #incidents Mattermost channel with
severity, impact scope, and initial hypothesis.
3. Diagnose
Use the runbook linked in the alert annotation. Check: - Grafana dashboards — error rate, latency, saturation - Loki logs — filter to the affected service and time window - Tempo traces — identify which service call is failing
4. Mitigate
Prioritise restoring service over finding root cause:
- Rollback recent deployment (argocd app rollback <app>)
- Scale up replicas to absorb load
- Enable a feature flag to disable the failing feature
5. Resolve and Learn
After service is restored, write a blameless post-mortem within 48 hours. Document: - Timeline of events - Root cause - Contributing factors - Action items with owners and due dates
Post-mortems are stored in docs/runbooks/post-mortems/ and shared with the team.
Runbooks
Every P1/P2 alert must link to a runbook in docs/runbooks/. Runbooks are tested
quarterly — an untested runbook is unreliable under pressure.