Playbook: DORA Metrics Implementation
Estimated Duration: 4-8 hours Complexity: ⭐⭐ Medium Target Audience: Platform Engineers / DevOps Engineers / Consultants
I. Business Objective
Diátaxis: Explanation / Conceptual
This section defines the "why"—the risk mitigated, compliance goal achieved, and value delivered.
What We're Solving
Organizations often struggle to objectively measure their software delivery performance. Without data, improvement efforts are based on intuition rather than evidence, making it impossible to identify true bottlenecks, demonstrate progress to stakeholders, or justify investment in engineering improvements.
DORA (DevOps Research and Assessment) research has identified five key metrics that reliably predict software delivery and organizational performance. This playbook implements automated collection and visualization of these metrics within the Fawkes platform.
Risk Mitigation
| Risk | Impact Without Action | How This Playbook Helps |
|---|---|---|
| Invisible bottlenecks | Teams waste effort on wrong improvements | Data reveals actual constraints |
| Unable to demonstrate improvement | Stakeholders lose confidence in engineering | Dashboards show measurable progress |
| Slow incident response | Prolonged outages damage customer trust | MTTR tracking drives faster recovery |
| High change failure rate | Quality issues erode user satisfaction | Early detection enables proactive fixes |
| High rework rate | Teams patching silently rather than improving | Rework tracking exposes fix-deploy cycles |
Expected Outcomes
- ✅ Automated collection of all five DORA metrics
- ✅ Real-time dashboards showing current performance levels
- ✅ Historical trend analysis for improvement tracking
- ✅ Team-level breakdowns for targeted interventions
- ✅ Alerts for performance degradation
Business Value
| Metric | Before | After | Improvement |
|---|---|---|---|
| Visibility into delivery performance | None/Manual | Automated, Real-time | ∞ improvement |
| Time to identify bottlenecks | Days/Weeks | Minutes | 90%+ reduction |
| Engineering productivity discussions | Opinion-based | Data-driven | Qualitative shift |
| Stakeholder confidence | Low | High | Measurable progress |
II. Technical Prerequisites
Diátaxis: Reference
This section lists required Fawkes components, versions, and environment specifications.
Required Fawkes Components
| Component | Minimum Version | Required | Documentation |
|---|---|---|---|
| Kubernetes | 1.28+ | ✅ | See Getting Started |
| Prometheus | 2.47+ | ✅ | See Prometheus Tool |
| Grafana | 10.2+ | ✅ | See Observability |
| Jenkins | 2.426+ | ✅ | See Jenkins Tool |
| ArgoCD | 2.9+ | ✅ | See GitOps Module |
| DevLake | 0.19+ | ⬜ Optional | See DevLake ADR |
Environment Requirements
# Minimum cluster resources for DORA metrics stack
nodes: 3
cpu_per_node: 4 cores
memory_per_node: 16 GB
storage: 50 GB (for metrics retention)
# Network requirements
ingress_controller: nginx or traefik
external_access: required for dashboards
Access Requirements
- [ ] Cluster admin access to Kubernetes
- [ ] Git repository webhook configuration rights
- [ ] Jenkins admin access for plugin installation
- [ ] ArgoCD admin access for webhook configuration
Pre-Implementation Checklist
- [ ] CI/CD pipeline (Jenkins) is operational
- [ ] GitOps (ArgoCD) is deployed and managing applications
- [ ] Prometheus and Grafana are running in the cluster
- [ ] At least one application is being deployed via GitOps
- [ ] Stakeholder approval for metrics collection obtained
III. Implementation Steps
Diátaxis: How-to Guide (Core)
This is the core of the playbook—step-by-step procedures using Fawkes components.
Step 1: Configure Deployment Event Collection
Objective: Capture deployment events from ArgoCD to measure deployment frequency and lead time.
Estimated Time: 45 minutes
- Create the DORA metrics namespace:
kubectl create namespace dora-metrics
- Deploy the deployment event collector:
# dora-deployment-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dora-collector-config
namespace: dora-metrics
data:
config.yaml: |
collectors:
- type: argocd
webhook_path: /webhooks/argocd
metrics:
- deployment_frequency
- lead_time_for_changes
- type: jenkins
webhook_path: /webhooks/jenkins
metrics:
- build_time
- pipeline_success_rate
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dora-collector
namespace: dora-metrics
spec:
replicas: 1
selector:
matchLabels:
app: dora-collector
template:
metadata:
labels:
app: dora-collector
spec:
containers:
- name: collector
image: fawkes/dora-collector:v1.0.0
ports:
- containerPort: 8080
volumeMounts:
- name: config
mountPath: /etc/dora
volumes:
- name: config
configMap:
name: dora-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: dora-collector
namespace: dora-metrics
spec:
selector:
app: dora-collector
ports:
- port: 8080
targetPort: 8080
- Apply the configuration:
kubectl apply -f dora-deployment-collector.yaml
Verification: Check that the collector pod is running:
kubectl get pods -n dora-metrics -l app=dora-collector
Expected Output
NAME READY STATUS RESTARTS AGEE
dora-collector-7d9f8b6c4f-x2k9j 1/1 Running 0 30s
Step 2: Configure ArgoCD Webhooks
Objective: Connect ArgoCD deployment events to the DORA collector.
Estimated Time: 30 minutes
- Get the DORA collector service endpoint:
kubectl get svc dora-collector -n dora-metrics -o jsonpath='{.spec.clusterIP}'
- Configure ArgoCD notifications to send deployment events:
# argocd-notifications-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-notifications-cm
namespace: argocd
data:
service.webhook.dora: |
url: http://dora-collector.dora-metrics:8080/webhooks/argocd
headers:
- name: Content-Type
value: application/json
template.deployment-event: |
webhook:
dora:
method: POST
body: |
{
"application": "{{.app.metadata.name}}",
"status": "{{.app.status.sync.status}}",
"revision": "{{.app.status.sync.revision}}",
"timestamp": "{{.app.status.operationState.finishedAt}}"
}
trigger.on-sync-succeeded: |
- when: app.status.sync.status == 'Synced'
send: [deployment-event]
- Apply the notification configuration:
kubectl apply -f argocd-notifications-cm.yaml
Verification: Trigger a deployment and check for webhook delivery.
Step 3: Configure Jenkins Pipeline Metrics
Objective: Capture build and pipeline metrics from Jenkins.
Estimated Time: 45 minutes
- Install the Prometheus metrics plugin in Jenkins:
// In Jenkins shared library
// vars/doraMetrics.groovy
def recordDeployment(Map config) {
def startTime = config.startTime ?: currentBuild.startTimeInMillis
def endTime = System.currentTimeMillis()
def leadTime = endTime - startTime
sh """
curl -X POST http://dora-collector.dora-metrics:8080/metrics/deployment \\
-H 'Content-Type: application/json' \\
-d '{
"service": "${config.service}",
"environment": "${config.environment}",
"commit_sha": "${config.commitSha}",
"lead_time_ms": ${leadTime},
"status": "${currentBuild.result ?: 'SUCCESS'}"
}'
"""
}
def recordFailure(Map config) {
sh """
curl -X POST http://dora-collector.dora-metrics:8080/metrics/failure \\
-H 'Content-Type: application/json' \\
-d '{
"service": "${config.service}",
"environment": "${config.environment}",
"type": "${config.type}",
"detected_at": "${new Date().toInstant()}"
}'
"""
}
- Update pipelines to emit DORA metrics:
// Example Jenkinsfile integration
@Library('fawkes-shared-library') _
pipeline {
agent any
stages {
stage('Deploy') {
steps {
script {
// Deployment logic here
sh 'kubectl apply -f manifests/'
// Record deployment for DORA metrics
doraMetrics.recordDeployment(
service: 'my-service',
environment: 'production',
commitSha: env.GIT_COMMIT
)
}
}
}
}
post {
failure {
script {
doraMetrics.recordFailure(
service: 'my-service',
environment: 'production',
type: 'deployment_failure'
)
}
}
}
}
Verification: Run a pipeline and verify metrics are being collected.
Common Pitfall
Ensure the Jenkins service account has network access to the dora-collector service. If using network policies, create appropriate rules.
Step 4: Configure Incident Tracking for MTTR
Objective: Track incident detection and resolution for Mean Time to Restore measurement.
Estimated Time: 30 minutes
- Configure Prometheus alerting to record incidents:
# prometheus-dora-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: dora-incident-rules
namespace: monitoring
spec:
groups:
- name: dora-incidents
rules:
- alert: ServiceDown
expr: up{job=~".*production.*"} == 0
for: 1m
labels:
severity: critical
dora_incident: "true"
annotations:
summary: "Service {{ $labels.job }} is down"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
dora_incident: "true"
annotations:
summary: "High error rate for {{ $labels.service }}"
- Configure Alertmanager to notify DORA collector:
# alertmanager-dora-config.yaml
receivers:
- name: dora-collector
webhook_configs:
- url: "http://dora-collector.dora-metrics:8080/webhooks/alertmanager"
send_resolved: true
route:
receiver: dora-collector
routes:
- match:
dora_incident: "true"
receiver: dora-collector
Verification: Trigger a test alert and verify it's recorded.
Step 5: Deploy DORA Metrics Dashboard
Objective: Create visualization dashboards for all four DORA metrics.
Estimated Time: 30 minutes
- Deploy the DORA Grafana dashboard:
# dora-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dora-dashboard
namespace: monitoring
labels:
grafana_dashboard: "true"
data:
dora-metrics.json: |
{
"title": "DORA Metrics Dashboard",
"panels": [
{
"title": "Deployment Frequency",
"type": "stat",
"targets": [{
"expr": "sum(increase(dora_deployments_total[7d]))"
}]
},
{
"title": "Lead Time for Changes",
"type": "gauge",
"targets": [{
"expr": "avg(dora_lead_time_seconds)"
}]
},
{
"title": "Change Failure Rate",
"type": "gauge",
"targets": [{
"expr": "sum(dora_deployment_failures_total) / sum(dora_deployments_total) * 100"
}]
},
{
"title": "Mean Time to Restore",
"type": "gauge",
"targets": [{
"expr": "avg(dora_mttr_seconds)"
}]
}
]
}
- Apply the dashboard:
kubectl apply -f dora-dashboard-configmap.yaml
Verification: Access Grafana and verify the DORA dashboard is visible.
Step 6: Configure Deployment Rework Rate Collection
Objective: Detect and record deployments that require a follow-up fix deployment within a configurable time window.
Estimated Time: 30 minutes
A deployment is counted as rework if a second deployment to the same service occurs within the rework window (default: 24 hours) and is tagged or flagged as a fix using any of the following signals:
- Conventional commit prefix:
fix:,hotfix:, orrevert: - Deployment labelled
dora.dev/rework: "true"in the ArgoCD Application spec -
Pull request title or body contains
[rework]or[hotfix] -
Add the rework detection configuration to the DORA collector:
# dora-rework-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dora-collector-config
namespace: dora-metrics
data:
config.yaml: |
collectors:
- type: argocd
webhook_path: /webhooks/argocd
metrics:
- deployment_frequency
- lead_time_for_changes
- deployment_rework_rate
rework_detection:
window_hours: 24 # configurable rework window (default: 24 h)
fix_commit_patterns:
- "^fix:"
- "^hotfix:"
- "^revert:"
fix_label: "dora.dev/rework" # ArgoCD Application label to force-tag a rework
fix_pr_patterns:
- "\\[rework\\]"
- "\\[hotfix\\]"
- Annotate fix deployments explicitly (optional — for non-conventional commits):
# In your ArgoCD Application spec, add the label to the next sync:
metadata:
labels:
dora.dev/rework: "true"
- Apply the updated configuration:
kubectl apply -f dora-rework-config.yaml
- Verify rework detection is active:
kubectl exec -n dora-metrics deploy/dora-collector -- \
curl -s localhost:8080/metrics | grep dora_deployment_rework
Expected Result: dora_deployment_rework_total metric is present (value may be 0
until a rework deployment occurs).
Verification: Trigger a deployment followed by a fix: commit deployment within
24 hours and confirm the rework count increments.
IV. Validation & Success Metrics
Diátaxis: How-to Guide / Reference
Instructions for verifying the implementation and measuring success.
Functional Validation
Test 1: Deployment Frequency Collection
# Trigger a test deployment
kubectl set image deployment/test-app app=nginx:latest -n test
# Check metrics endpoint
kubectl exec -n dora-metrics deploy/dora-collector -- \
curl -s localhost:8080/metrics | grep dora_deployments_total
Expected Result: Metric should show at least 1 deployment recorded.
Test 2: Lead Time Calculation
# Query Prometheus for lead time
kubectl exec -n monitoring deploy/prometheus -- \
promtool query instant 'avg(dora_lead_time_seconds)'
Expected Result: Returns a valid duration in seconds.
Test 3: MTTR Recording
# Simulate an incident resolution
curl -X POST http://dora-collector.dora-metrics:8080/incidents/resolve \
-d '{"incident_id": "test-123", "resolved_at": "'$(date -Iseconds)'"}'
# Verify MTTR metric
kubectl exec -n monitoring deploy/prometheus -- \
promtool query instant 'dora_mttr_seconds'
Expected Result: MTTR metric is populated.
Success Metrics
| Metric | How to Measure | Target Value | Dashboard Link |
|---|---|---|---|
| Data Collection | Check dora_* metrics exist |
All 5 metrics present | /grafana/dora |
| Dashboard Load | Grafana dashboard loads | < 3 seconds | /grafana/dora |
| Historical Data | Query 7-day data | Data available | /grafana/dora |
Verification Checklist
- [ ] All five DORA metrics are being collected
- [ ] Grafana dashboard displays correctly
- [ ] Team-level filtering works
- [ ] Historical trend data is accumulating
- [ ] Alerts trigger correctly for metric degradation
DORA Metrics Impact
This playbook establishes the foundation for measuring DORA metrics. After 2-4 weeks of data collection, you'll be able to:
| DORA Metric | Initial Baseline | Typical Elite Target |
|---|---|---|
| Deployment Frequency | Measured | Multiple per day |
| Lead Time for Changes | Measured | < 1 hour |
| Change Failure Rate | Measured | 0-15% |
| Time to Restore | Measured | < 1 hour |
| Deployment Rework Rate | Measured | < 5% |
V. Client Presentation Talking Points
Diátaxis: Explanation / Conceptual
Ready-to-use business language for communicating success to client executives.
Executive Summary
We've implemented automated measurement of software delivery performance using the industry-standard DORA metrics framework. Your organization now has real-time visibility into deployment frequency, lead time, change failure rate, recovery time, and deployment rework rate—the five metrics that DORA 2025 research proves correlate with business performance. This data-driven foundation enables targeted improvements and demonstrates engineering progress to stakeholders.
Key Messages for Stakeholders
For Technical Leaders (CTO, VP Engineering)
- "We've implemented automated DORA metrics collection that tracks all five key performance indicators across your delivery pipeline"
- "This positions your organization to identify bottlenecks with data rather than intuition—teams can now see exactly where time is spent in the delivery process"
- "Elite performers in the DORA research deploy on-demand, have lead times under one hour, fail less than 15% of the time, and recover in under one hour. You now have the data to benchmark and improve toward these targets."
For Business Leaders (CEO, CFO)
- "This investment gives you visibility into engineering productivity for the first time—you'll see exactly how fast features move from idea to customer"
- "DORA research proves that organizations with elite software delivery performance are 2x more likely to exceed their business goals. These metrics are your leading indicator."
- "Faster, more reliable software delivery translates directly to faster time-to-market and reduced risk of outages that impact customers and revenue"
Demonstration Script
-
Open: "Let me show you the DORA Metrics Dashboard in Grafana. This gives us real-time visibility into four key performance indicators..."
-
Show Deployment Frequency: "This shows how often we're deploying to production. Elite performers deploy on-demand, multiple times per day. Our current rate is [X]..."
-
Show Lead Time: "Lead time measures the time from when a developer commits code to when it's running in production. Elite performers achieve this in under one hour. We're currently at [X]..."
-
Show Change Failure Rate: "This shows what percentage of our deployments cause problems. Elite teams are below 15%. We're at [X%]..."
-
Show MTTR: "When something does go wrong, how quickly do we recover? Elite teams restore service in under an hour. Our current mean time to restore is [X]..."
-
Show Deployment Rework Rate: "This is our rework rate—the percentage of deployments that needed a follow-up fix within 24 hours. A high rework rate often means silent patching instead of process improvement. We're currently at [X%], targeting below 5%..."
-
Connect to value: "By tracking these five metrics, we can identify exactly where to focus improvement efforts and demonstrate progress to the business."
Common Executive Questions & Answers
How does this compare to industry benchmarks?
According to the 2023 State of DevOps Report from DORA, elite performers deploy on demand (often multiple times per day), have lead times under one hour, a change failure rate of 0-15%, and recover from failures in under one hour. Your current metrics place you in the [Elite/High/Medium/Low] performance category, which aligns with approximately [X%] of organizations studied.
What's the ROI on this implementation?
The primary ROI is in enabling data-driven improvement. Organizations that improve from Medium to Elite performance see 2x improvement in organizational performance goals according to DORA research. Additionally, by identifying bottlenecks, we typically see 20-30% improvement in developer productivity within the first quarter.
What's the risk if we don't maintain this?
Without continued attention, metrics data quality may degrade as systems change. We recommend quarterly reviews of data collection configuration as part of normal platform maintenance. The cost of maintenance is minimal compared to the value of continued visibility.
What's the next step after implementing metrics?
With baseline metrics established, the next step is to identify your primary bottleneck. Typically, this is either deployment frequency (solved by automation) or lead time (solved by pipeline optimization). We can run a focused improvement sprint targeting your biggest constraint.
Follow-Up Actions
| Action | Owner | Timeline |
|---|---|---|
| Review baseline metrics after 2 weeks | Engineering Lead | +2 weeks |
| Identify primary improvement opportunity | Platform Team | +3 weeks |
| Begin targeted improvement playbook | Consultant/Team | +4 weeks |
| Schedule stakeholder review | Consultant | +6 weeks |
Appendix
Related Resources
- Module 2: DORA Metrics - Conceptual background on the five key metrics
- DORA Dashboard Demo - Five-metric dashboard overview
- Prometheus Tool Reference - Metrics collection details
- DORA Metrics Guide - Detailed DORA implementation guide
Troubleshooting
| Issue | Possible Cause | Resolution |
|---|---|---|
| Metrics not appearing | Webhook not configured | Verify ArgoCD/Jenkins webhook settings |
| Dashboard empty | Prometheus not scraping | Check Prometheus targets and scrape config |
| Incorrect lead time | Clock skew | Ensure NTP sync across nodes |
| Missing deployments | Network policy blocking | Add network policy for dora-metrics namespace |
Change Log
| Date | Version | Changes |
|---|---|---|
| 2024-01-15 | 1.0 | Initial release |