ADR-016: DevLake for DORA Metrics Visualization
Status
Accepted
Context
The Fawkes platform requires a unified way to collect, calculate, and display the five core DORA metrics to enable teams to assess their service performance, identify bottlenecks, and ensure adherence to Golden Path standards.
Current State
IDP concepts (CI/CD, secrets, infrastructure provisioning, policy enforcement) are disconnected, residing across various URLs (Jenkins, Vault UI, Kubernetes CLI). There is no integrated interface for developers to interact with the platform or view consolidated DORA metrics.
The existing ADR-012 (Metrics Monitoring and Management) defines a custom DORA Metrics Service for webhook-based event collection. While this approach provides flexibility, it requires significant custom development effort.
Requirements
- Deployment Frequency: Track successful production deployments per day/week
- Lead Time for Changes: Measure commit time to production deployment time
- Change Failure Rate (CFR): Calculate failed deployments / total deployments
- Mean Time to Restore (MTTR): Measure incident detection to resolution time
- Operational Performance: Track SLO/SLI adherence for reliability metrics
Evaluation Criteria
- Open source with active community
- Native DORA metrics calculation
- Integration with Git, Jenkins, and incident management
- Backstage integration capability
- Kubernetes-native deployment
- Minimal operational overhead
Decision
We will deploy Apache DevLake as the DORA metrics collection, calculation, and visualization platform for Fawkes.
Why Apache DevLake
Advantages:
- Purpose-built for DORA: Native support for all four DORA metrics
- Multi-source Integration: Pre-built collectors for GitHub, GitLab, Jenkins, Jira, and more
- Open Source: Apache 2.0 license, active CNCF community
- Extensible: Plugin architecture for custom data sources
- Grafana Integration: Pre-built dashboards for DORA visualization
- Low Custom Development: Reduces need for custom webhook handlers
Trade-offs:
- Additional Component: Adds DevLake stack (API, UI, collector workers)
- Database Requirement: Requires MySQL/PostgreSQL for data storage
- Learning Curve: Teams need to understand DevLake configuration
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ DevLake DORA Metrics Layer │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Data Sources (Collectors) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ GitHub │ │ Jenkins │ │ Observability│ │ Custom │ │ │
│ │ │ Plugin │ │ Plugin │ │ Webhook │ │ Webhook │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ • Commits │ │ • Builds │ │ • Incidents │ │ • CFR Events│ │ │
│ │ │ • PRs │ │ • Deploys │ │ • Restores │ │ • MTTR Data │ │ │
│ │ │ • Branches │ │ • Pipelines │ │ • Alerts │ │ │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │ │ │
│ └─────────┼─────────────────┼─────────────────┼────────────────┼──────────┘ │
│ │ │ │ │ │
│ └─────────────────┴─────────────────┴────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DevLake Core Services │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ DevLake API │ │ DevLake UI │ │ Lake Workers │ │ │
│ │ │ (Config/Data) │ │ (Dashboard) │ │ (Collectors) │ │ │
│ │ └────────┬────────┘ └─────────────────┘ └────────┬────────┘ │ │
│ │ │ │ │ │
│ │ └────────────────┬─────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐│ │
│ │ │ MySQL Database ││ │
│ │ │ (Raw data, domain models, DORA calculations) ││ │
│ │ └─────────────────────────────────────────────────────────────────────┘│ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Prometheus Metrics Export │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Visualization Layer │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Grafana │ │ Backstage │ │ DevLake UI │ │ │
│ │ │ DORA Dashboard │ │ Plugin Widget │ │ (Native UI) │ │ │
│ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
DORA Metrics Calculation
DevLake calculates DORA metrics using the appropriate data sources for a GitOps architecture where ArgoCD handles deployments:
| Metric | Primary Data Source | Secondary Source | Calculation |
|---|---|---|---|
| Deployment Frequency | ArgoCD sync events | GitHub deployments | Successful production syncs per time period |
| Lead Time for Changes | Git commits + ArgoCD | - | Time from first commit to ArgoCD sync completion |
| Change Failure Rate | ArgoCD + Incidents | Observability alerts | Failed syncs / Total syncs |
| Mean Time to Restore | Incident records + ArgoCD | - | Time from incident to restore sync |
| Operational Performance | Prometheus SLO/SLI data | - | Uptime, latency, error rate adherence |
Note on Jenkins Role: In a GitOps architecture, Jenkins handles CI (build, test, scan) but does not perform deployments. Jenkins updates the GitOps repository, which ArgoCD then reconciles. Therefore:
- ArgoCD is the source of truth for deployment events
- Jenkins provides build/test metrics and rework tracking (build failures, test reruns, quality gate failures)
Jenkins Rework Metrics
Jenkins contributes to developer productivity metrics beyond core DORA:
| Metric | Description | Calculation |
|---|---|---|
| Build Success Rate | Percentage of successful builds | Passed builds / Total builds |
| Test Flakiness | Frequency of non-deterministic test failures | Flaky runs / Total runs |
| Quality Gate Pass Rate | SonarQube gate success | Passed gates / Total scans |
| Rework Rate | Builds triggered by same commit | Rebuilds / Unique commits |
| Build Duration Trend | Average build time over time | Mean/P95 build duration |
Deployment Configuration
Namespace Isolation:
- DevLake deployed in
fawkes-devlakenamespace - Secrets managed via Vault/External Secrets
- Network policies restrict access to approved sources
Resource Requirements:
devlake-api:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
devlake-ui:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
mysql:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
Data Source Configuration
GitHub/Git Integration:
- OAuth app for repository access
- Webhook receiver for real-time events
- Collect: commits, PRs, branches
ArgoCD Integration (Primary deployment source):
- API access for sync event collection
- Deployment frequency from successful syncs
- Lead time correlation with commit timestamps
- Failure detection from sync failures
Jenkins Integration (CI and rework metrics):
- API token for build data access
- Build success/failure tracking
- Quality gate results
- Rework metrics: build retries, flaky tests, repeated failures
Observability Integration:
- Webhook endpoint for incident events
- Alert-to-incident correlation
- Resolution tracking for MTTR
Backstage Integration
DevLake will integrate with Backstage via:
- Iframe Plugin: Embed DevLake dashboards in Backstage
- API Integration: Fetch DORA metrics for service cards
- Entity Annotation: Link catalog entities to DevLake projects
Edge Case Handling
| Scenario | Behavior |
|---|---|
| No deployments for service | Display "N/A" instead of 0 |
| Partial data (missing incidents) | Calculate available metrics, show warnings |
| High cardinality (many services) | Use pagination and caching |
| Data source unavailable | Show last known values with staleness indicator |
Consequences
Positive
- Immediate DORA visibility: Pre-built dashboards reduce time to value
- Standardized metrics: Consistent calculation across all teams
- Reduced custom development: Leverage DevLake's tested collectors
- Extensibility: Plugin architecture for future data sources
- Community support: Active Apache project with regular updates
Negative
- Additional infrastructure: MySQL database and DevLake services
- Configuration complexity: Multiple data source configurations
- Storage requirements: Historical data requires database capacity
Risks and Mitigations
| Risk | Mitigation |
|---|---|
| DevLake version incompatibility | Pin to stable versions, test upgrades |
| Data collection failures | Alerting on collector health, retry logic |
| Dashboard performance | Enable caching, limit historical queries |
| Security exposure | Vault-managed credentials, network policies |
Alternatives Considered
1. Custom DORA Metrics Service (ADR-012 Approach)
Description: Build custom Go/Python service with webhook receivers
Rejected because:
- Higher development and maintenance cost
- Need to implement all collectors from scratch
- No pre-built visualization
2. Sleuth.io (Commercial)
Description: SaaS DORA metrics platform
Rejected because:
- Commercial licensing costs
- External data residency concerns
- Vendor lock-in
3. Faros AI (Commercial)
Description: Engineering metrics platform
Rejected because:
- Commercial licensing
- More complex than needed
- Over-featured for core DORA requirements
4. Haystack (Open Source)
Description: Engineering analytics platform
Rejected because:
- Smaller community than DevLake
- Less mature DORA implementation
- Limited documentation
Implementation Plan
Phase 1: Core Deployment (Week 1)
- [x] Create ADR-016 for DevLake strategy
- [ ] Deploy DevLake Helm chart via ArgoCD
- [ ] Configure MySQL persistence
- [ ] Set up ingress for DevLake UI
Phase 2: Data Source Integration (Week 2)
- [ ] Configure GitHub/Git collector
- [ ] Configure Jenkins collector
- [ ] Set up incident webhook receiver
- [ ] Create initial projects in DevLake
Phase 3: Visualization (Week 3)
- [ ] Import DORA Grafana dashboards
- [ ] Create Backstage DevLake plugin
- [ ] Configure team-level views
- [ ] Set up alerting for metric degradation
Phase 4: Documentation & Training (Week 4)
- [ ] Update architecture documentation
- [ ] Create DORA metrics definition guide
- [ ] Add Dojo learning module
- [ ] Create runbooks for operations