Prometheus
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability in modern cloud environments.
Overview
Prometheus provides essential monitoring capabilities: - Time-Series Database - Store and query metrics data - PromQL - Powerful query language for metrics analysis - Alert Manager - Handle alerts and notifications - Service Discovery - Automatic target discovery
Key Features
Feature | Description |
---|---|
![]() |
Pull-based metrics gathering |
![]() |
Flexible query language |
![]() |
Configurable alert rules |
![]() |
Auto-discover targets |
Integration with Fawkes
Prerequisites
- Kubernetes cluster
- Helm v3
- kubectl configured with cluster access
Installation
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus Stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values prometheus-values.yaml
Example prometheus-values.yaml
:
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
grafana:
enabled: true
persistence:
enabled: true
size: 10Gi
alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
Configuring Prometheus Rules
Basic Recording Rules
groups:
- name: fawkes-recording-rules
rules:
- record: job:http_requests_total:rate5m
expr: rate(http_requests_total[5m])
- record: job:http_errors_total:rate5m
expr: rate(http_errors_total[5m])
Alert Rules
groups:
- name: fawkes-alerts
rules:
- alert: HighErrorRate
expr: job:http_errors_total:rate5m / job:http_requests_total:rate5m > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High error rate detected
description: Error rate is above 10% for 5 minutes
Monitoring DORA Metrics
Deployment Frequency
groups:
- name: dora-metrics
rules:
- record: dora:deployment_frequency:count24h
expr: count_over_time(deployment_success_total[24h])
- record: dora:lead_time_seconds:avg24h
expr: avg_over_time(deployment_lead_time_seconds[24h])
Best Practices
- Data Retention
- Set appropriate retention periods
- Use persistent storage
-
Implement data compaction
-
Query Optimization
- Use recording rules for complex queries
- Limit the use of high-cardinality labels
-
Cache frequently used queries
-
Alerting
- Define clear alerting thresholds
- Implement proper alert routing
- Avoid alert fatigue
Troubleshooting
Common issues and solutions:
Issue | Solution |
---|---|
High memory usage | Adjust retention period and storage |
Slow queries | Review and optimize PromQL expressions |
Missing metrics | Check service discovery configuration |
Grafana Dashboard Examples
{
"dashboard": {
"id": null,
"title": "Fawkes DORA Metrics",
"panels": [
{
"title": "Deployment Frequency",
"type": "graph",
"targets": [
{
"expr": "dora:deployment_frequency:count24h"
}
]
}
]
}
}