Skip to content

Prometheus

Prometheus Logo

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability in modern cloud environments.

Overview

Prometheus provides essential monitoring capabilities:

  • Time-Series Database - Store and query metrics data
  • PromQL - Powerful query language for metrics analysis
  • Alert Manager - Handle alerts and notifications
  • Service Discovery - Automatic target discovery

Key Features

Feature Description
Metrics Collection Pull-based metrics gathering
PromQL Flexible query language
Alerting Configurable alert rules
Service Discovery Auto-discover targets

Integration with Fawkes

Prerequisites

  • Kubernetes cluster
  • Helm v3
  • kubectl configured with cluster access

Installation

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus Stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values prometheus-values.yaml

Example prometheus-values.yaml:

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

grafana:
  enabled: true
  persistence:
    enabled: true
    size: 10Gi

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ["job"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h

Configuring Prometheus Rules

Basic Recording Rules

groups:
  - name: fawkes-recording-rules
    rules:
      - record: job:http_requests_total:rate5m
        expr: rate(http_requests_total[5m])
      - record: job:http_errors_total:rate5m
        expr: rate(http_errors_total[5m])

Alert Rules

groups:
  - name: fawkes-alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_errors_total:rate5m / job:http_requests_total:rate5m > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High error rate detected
          description: Error rate is above 10% for 5 minutes

Monitoring DORA Metrics

Deployment Frequency

groups:
  - name: dora-metrics
    rules:
      - record: dora:deployment_frequency:count24h
        expr: count_over_time(deployment_success_total[24h])
      - record: dora:lead_time_seconds:avg24h
        expr: avg_over_time(deployment_lead_time_seconds[24h])

Best Practices

  1. Data Retention

  2. Set appropriate retention periods

  3. Use persistent storage
  4. Implement data compaction

  5. Query Optimization

  6. Use recording rules for complex queries

  7. Limit the use of high-cardinality labels
  8. Cache frequently used queries

  9. Alerting

  10. Define clear alerting thresholds
  11. Implement proper alert routing
  12. Avoid alert fatigue

Troubleshooting

Common issues and solutions:

Issue Solution
High memory usage Adjust retention period and storage
Slow queries Review and optimize PromQL expressions
Missing metrics Check service discovery configuration

Grafana Dashboard Examples

{
  "dashboard": {
    "id": null,
    "title": "Fawkes DORA Metrics",
    "panels": [
      {
        "title": "Deployment Frequency",
        "type": "graph",
        "targets": [
          {
            "expr": "dora:deployment_frequency:count24h"
          }
        ]
      }
    ]
  }
}

Additional Resources

Configure Prometheus