Skip to content

Prometheus

Prometheus Logo

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability in modern cloud environments.

Overview

Prometheus provides essential monitoring capabilities: - Time-Series Database - Store and query metrics data - PromQL - Powerful query language for metrics analysis - Alert Manager - Handle alerts and notifications - Service Discovery - Automatic target discovery

Key Features

Feature Description
Metrics Collection Pull-based metrics gathering
PromQL Flexible query language
Alerting Configurable alert rules
Service Discovery Auto-discover targets

Integration with Fawkes

Prerequisites

  • Kubernetes cluster
  • Helm v3
  • kubectl configured with cluster access

Installation

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus Stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values prometheus-values.yaml

Example prometheus-values.yaml:

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

grafana:
  enabled: true
  persistence:
    enabled: true
    size: 10Gi

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h

Configuring Prometheus Rules

Basic Recording Rules

groups:
- name: fawkes-recording-rules
  rules:
  - record: job:http_requests_total:rate5m
    expr: rate(http_requests_total[5m])
  - record: job:http_errors_total:rate5m
    expr: rate(http_errors_total[5m])

Alert Rules

groups:
- name: fawkes-alerts
  rules:
  - alert: HighErrorRate
    expr: job:http_errors_total:rate5m / job:http_requests_total:rate5m > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High error rate detected
      description: Error rate is above 10% for 5 minutes

Monitoring DORA Metrics

Deployment Frequency

groups:
- name: dora-metrics
  rules:
  - record: dora:deployment_frequency:count24h
    expr: count_over_time(deployment_success_total[24h])
  - record: dora:lead_time_seconds:avg24h
    expr: avg_over_time(deployment_lead_time_seconds[24h])

Best Practices

  1. Data Retention
  2. Set appropriate retention periods
  3. Use persistent storage
  4. Implement data compaction

  5. Query Optimization

  6. Use recording rules for complex queries
  7. Limit the use of high-cardinality labels
  8. Cache frequently used queries

  9. Alerting

  10. Define clear alerting thresholds
  11. Implement proper alert routing
  12. Avoid alert fatigue

Troubleshooting

Common issues and solutions:

Issue Solution
High memory usage Adjust retention period and storage
Slow queries Review and optimize PromQL expressions
Missing metrics Check service discovery configuration

Grafana Dashboard Examples

{
  "dashboard": {
    "id": null,
    "title": "Fawkes DORA Metrics",
    "panels": [
      {
        "title": "Deployment Frequency",
        "type": "graph",
        "targets": [
          {
            "expr": "dora:deployment_frequency:count24h"
          }
        ]
      }
    ]
  }
}

Additional Resources

Configure Prometheus View Dashboards