Prometheus

Prometheus Logo

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability in modern cloud environments.

Overview

Prometheus provides essential monitoring capabilities: - Time-Series Database - Store and query metrics data - PromQL - Powerful query language for metrics analysis - Alert Manager - Handle alerts and notifications - Service Discovery - Automatic target discovery

Key Features

Feature	Description
Metrics Collection	Pull-based metrics gathering
PromQL	Flexible query language
Alerting	Configurable alert rules
Service Discovery	Auto-discover targets

Integration with Fawkes

Prerequisites

Kubernetes cluster
Helm v3
kubectl configured with cluster access

Installation

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus Stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values prometheus-values.yaml

Example prometheus-values.yaml:

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

grafana:
  enabled: true
  persistence:
    enabled: true
    size: 10Gi

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h

Configuring Prometheus Rules

Basic Recording Rules

groups:
- name: fawkes-recording-rules
  rules:
  - record: job:http_requests_total:rate5m
    expr: rate(http_requests_total[5m])
  - record: job:http_errors_total:rate5m
    expr: rate(http_errors_total[5m])

Alert Rules

groups:
- name: fawkes-alerts
  rules:
  - alert: HighErrorRate
    expr: job:http_errors_total:rate5m / job:http_requests_total:rate5m > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High error rate detected
      description: Error rate is above 10% for 5 minutes

Monitoring DORA Metrics

Deployment Frequency

groups:
- name: dora-metrics
  rules:
  - record: dora:deployment_frequency:count24h
    expr: count_over_time(deployment_success_total[24h])
  - record: dora:lead_time_seconds:avg24h
    expr: avg_over_time(deployment_lead_time_seconds[24h])

Best Practices

Data Retention
Set appropriate retention periods
Use persistent storage
Implement data compaction
Query Optimization
Use recording rules for complex queries
Limit the use of high-cardinality labels
Cache frequently used queries
Alerting
Define clear alerting thresholds
Implement proper alert routing
Avoid alert fatigue

Troubleshooting

Common issues and solutions:

Issue	Solution
High memory usage	Adjust retention period and storage
Slow queries	Review and optimize PromQL expressions
Missing metrics	Check service discovery configuration

Grafana Dashboard Examples

{
  "dashboard": {
    "id": null,
    "title": "Fawkes DORA Metrics",
    "panels": [
      {
        "title": "Deployment Frequency",
        "type": "graph",
        "targets": [
          {
            "expr": "dora:deployment_frequency:count24h"
          }
        ]
      }
    ]
  }
}

Additional Resources

Configure Prometheus View Dashboards