Epic 1: Platform Operations Runbook

Version: 1.0 Last Updated: December 2024 Status: Production Ready Target Audience: Platform Engineers, SREs, DevOps Engineers

Overview

This runbook provides operational procedures for the Epic 1 platform components, including:

Infrastructure: 4-node Kubernetes cluster
GitOps: ArgoCD
Developer Portal: Backstage
CI/CD: Jenkins
Security: SonarQube, Trivy, Vault, Kyverno
Observability: Prometheus, Grafana, OpenTelemetry, Fluent Bit
Registry: Harbor
DORA Metrics: DevLake
Supporting Services: cert-manager, ingress-nginx, External Secrets Operator

Component Status Checks

Quick Health Check (All Components)

# Check all platform namespaces
kubectl get namespaces | grep -E 'argocd|backstage|jenkins|sonarqube|prometheus|grafana|harbor|devlake|vault|kyverno'

# Check pod status across all platform components
kubectl get pods -A | grep -E 'argocd|backstage|jenkins|sonarqube|prometheus|grafana|harbor|devlake|vault|kyverno'

# Check all critical services
kubectl get svc -A | grep -E 'argocd-server|backstage|jenkins|sonarqube|prometheus|grafana|harbor|devlake|vault'

Kubernetes Cluster

# Check node status
kubectl get nodes
kubectl top nodes

# Check cluster resource utilization (should be <70%)
kubectl top nodes --no-headers | awk '{print $3}' | sed 's/%//' | awk '{sum+=$1; count++} END {print "Average CPU: " sum/count "%"}'
kubectl top nodes --no-headers | awk '{print $5}' | sed 's/%//' | awk '{sum+=$1; count++} END {print "Average Memory: " sum/count "%"}'

# Check for pending pods
kubectl get pods -A | grep Pending

# Check for failed pods
kubectl get pods -A | grep -E 'Error|CrashLoopBackOff|ImagePullBackOff'

ArgoCD (GitOps)

# Check ArgoCD health
kubectl get pods -n argocd
kubectl get applications -n argocd

# Check application sync status
argocd app list

# Check for out-of-sync applications
argocd app list | grep -v Synced

# Access ArgoCD UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Navigate to https://localhost:8080

Backstage (Developer Portal)

# Check Backstage health
kubectl get pods -n backstage
kubectl logs -n backstage -l app=backstage --tail=50

# Check PostgreSQL (Backstage database)
kubectl get pods -n backstage -l cnpg.io/cluster=db-backstage-dev

# Check Backstage service
kubectl get svc -n backstage backstage

# Access Backstage UI
kubectl port-forward svc/backstage -n backstage 7007:7007
# Navigate to http://localhost:7007

Jenkins (CI/CD)

# Check Jenkins health
kubectl get pods -n jenkins
kubectl logs -n jenkins -l app.kubernetes.io/component=jenkins-controller --tail=50

# Check Jenkins agents
kubectl get pods -n jenkins -l jenkins/label

# Get Jenkins admin password
kubectl get secret -n jenkins jenkins -o jsonpath="{.data.jenkins-admin-password}" | base64 --decode

# Access Jenkins UI
kubectl port-forward svc/jenkins -n jenkins 8080:8080
# Navigate to http://localhost:8080

SonarQube (Code Quality)

# Check SonarQube health
kubectl get pods -n sonarqube
kubectl logs -n sonarqube -l app=sonarqube --tail=50

# Check SonarQube database
kubectl get pods -n sonarqube -l app=postgresql

# Access SonarQube UI
kubectl port-forward svc/sonarqube-sonarqube -n sonarqube 9000:9000
# Navigate to http://localhost:9000
# Default credentials: admin/admin (change immediately)

Vault (Secrets Management)

# Check Vault health
kubectl get pods -n vault
kubectl exec -n vault vault-0 -- vault status

# Check all Vault pods are unsealed
for i in 0 1 2; do
  echo "Checking vault-$i:"
  kubectl exec -n vault vault-$i -- vault status | grep "Sealed"
done

# Check Vault operator logs
kubectl logs -n vault -l app.kubernetes.io/name=vault --tail=50

Kyverno (Policy Engine)

# Check Kyverno health
kubectl get pods -n kyverno
kubectl get clusterpolicies

# Check policy reports
kubectl get policyreports -A

# View policy violations
kubectl get policyreports -A -o json | jq '.items[] | select(.results[].result=="fail") | {namespace: .metadata.namespace, name: .metadata.name, violations: .results}'

Prometheus & Grafana (Observability)

# Check Prometheus health
kubectl get pods -n prometheus
kubectl get svc -n prometheus

# Check Grafana health
kubectl get pods -n grafana
kubectl get svc -n grafana

# Access Prometheus UI
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n prometheus 9090:9090
# Navigate to http://localhost:9090

# Access Grafana UI
kubectl port-forward svc/grafana -n grafana 3000:80
# Navigate to http://localhost:3000
# Default credentials: admin/prom-operator

Harbor (Container Registry)

# Check Harbor health
kubectl get pods -n harbor
kubectl logs -n harbor -l app=harbor --tail=50

# Check Harbor services
kubectl get svc -n harbor

# Access Harbor UI
kubectl port-forward svc/harbor -n harbor 8443:443
# Navigate to https://localhost:8443
# Default credentials: admin/Harbor12345

DevLake (DORA Metrics)

# Check DevLake health
kubectl get pods -n devlake
kubectl logs -n devlake -l app.kubernetes.io/name=devlake --tail=50

# Check MySQL database
kubectl get pods -n devlake -l app.kubernetes.io/name=mysql

# Access DevLake UI
kubectl port-forward svc/devlake-ui -n devlake 4000:4000
# Navigate to http://localhost:4000

Common Operations

Restarting a Component

# Restart ArgoCD
kubectl rollout restart deployment argocd-server -n argocd
kubectl rollout restart deployment argocd-repo-server -n argocd

# Restart Backstage
kubectl rollout restart deployment backstage -n backstage

# Restart Jenkins
kubectl rollout restart deployment jenkins -n jenkins

# Restart SonarQube
kubectl rollout restart deployment sonarqube-sonarqube -n sonarqube

# Restart Grafana
kubectl rollout restart deployment grafana -n grafana

# Restart Prometheus (statefulset)
kubectl rollout restart statefulset prometheus-kube-prometheus-prometheus -n prometheus

Scaling Components

# Scale Backstage
kubectl scale deployment backstage -n backstage --replicas=3

# Scale Jenkins agents (not the controller)
# Note: Jenkins agents are ephemeral and scale automatically

# Scale Grafana
kubectl scale deployment grafana -n grafana --replicas=2

# Note: Some components like Prometheus, Vault are StatefulSets
# and require special procedures for scaling

Updating Component Configuration

# Update ArgoCD configuration
kubectl edit configmap argocd-cm -n argocd
kubectl rollout restart deployment argocd-server -n argocd

# Update Backstage configuration
kubectl edit configmap backstage-app-config -n backstage
kubectl rollout restart deployment backstage -n backstage

# Update Jenkins configuration (JCasC)
kubectl edit configmap jenkins-casc-config -n jenkins
kubectl rollout restart deployment jenkins -n jenkins

Checking Logs

# Stream logs from a component
kubectl logs -n <namespace> -l <label-selector> -f

# Get recent logs from all replicas
kubectl logs -n <namespace> -l <label-selector> --tail=100 --all-containers=true

# Get logs from a specific pod
kubectl logs -n <namespace> <pod-name> -c <container-name>

# Export logs for analysis
kubectl logs -n <namespace> -l <label-selector> --since=1h > component-logs.txt

Viewing Events

# Get recent events for a namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Watch events in real-time
kubectl get events -n <namespace> -w

# Filter events by type
kubectl get events -n <namespace> --field-selector type=Warning

Troubleshooting

Pod Not Starting

Symptoms: Pod stuck in Pending, ContainerCreating, or CrashLoopBackOff state

Diagnosis:

# Check pod status and events
kubectl describe pod <pod-name> -n <namespace>

# Check pod logs
kubectl logs <pod-name> -n <namespace>

# Check previous container logs if pod restarted
kubectl logs <pod-name> -n <namespace> --previous

Common Causes:

Insufficient Resources: Node doesn't have enough CPU/memory

# Check node resources
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

Solution: Scale down other workloads or add nodes

Image Pull Errors: Cannot pull container image

# Check image pull secrets
kubectl get secrets -n <namespace>

Solution: Verify Harbor credentials or image name

Configuration Errors: Invalid ConfigMap or Secret references

# Verify ConfigMap exists
kubectl get configmap -n <namespace>

# Verify Secret exists
kubectl get secret -n <namespace>

Solution: Create missing resources or fix references

Policy Violations: Kyverno blocking pod creation
```
# Check policy reports
kubectl get policyreports -n <namespace>
```
Solution: Fix policy violations or update policies

Service Not Accessible

Symptoms: Cannot access service via port-forward or ingress

Diagnosis:

# Check service exists
kubectl get svc -n <namespace>

# Check endpoints
kubectl get endpoints -n <namespace>

# Check ingress
kubectl get ingress -n <namespace>

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller

Common Causes:

No Healthy Pods: Service has no ready endpoints

kubectl get pods -n <namespace> -o wide

Solution: Fix pod issues first

Incorrect Service Selector: Service not matching pods

kubectl describe svc <service-name> -n <namespace>

Solution: Update service selector to match pod labels

Ingress Misconfiguration: TLS or path issues
```
kubectl describe ingress <ingress-name> -n <namespace>
```
Solution: Verify hostname, paths, and TLS certificates

ArgoCD Application Out of Sync

Symptoms: Application shows OutOfSync status in ArgoCD

Diagnosis:

# Check application status
argocd app get <app-name>

# Check differences
argocd app diff <app-name>

# Check sync history
argocd app history <app-name>

Solutions:

# Manual sync
argocd app sync <app-name>

# Force sync (if resources deleted)
argocd app sync <app-name> --force

# Sync with prune (remove extra resources)
argocd app sync <app-name> --prune

# Hard refresh (clear cache)
argocd app get <app-name> --hard-refresh

Jenkins Build Failures

Symptoms: Builds failing consistently

Diagnosis:

# Check Jenkins agent pods
kubectl get pods -n jenkins -l jenkins/label

# Check Jenkins logs
kubectl logs -n jenkins -l app.kubernetes.io/component=jenkins-controller --tail=200

# Access Jenkins and check build console output

Common Causes:

Agent Connection Issues: Agents can't connect to controller Solution: Check network policies and service endpoints
Insufficient Agent Resources: Agents running out of memory/CPU

kubectl top pods -n jenkins -l jenkins/label

Solution: Increase agent resource requests/limits

Quality Gate Failures: SonarQube or security scan blocking build Solution: Review scan results and fix issues or adjust quality gates

Prometheus Metrics Missing

Symptoms: Dashboards show "No data" or metrics not being collected

Diagnosis:

# Check Prometheus targets
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n prometheus 9090:9090
# Navigate to http://localhost:9090/targets

# Check ServiceMonitor resources
kubectl get servicemonitors -A

# Check Prometheus logs
kubectl logs -n prometheus prometheus-kube-prometheus-prometheus-0

Solutions:

# Verify ServiceMonitor exists for component
kubectl get servicemonitor -n <namespace>

# Verify pod has metrics port and prometheus annotations
kubectl describe pod <pod-name> -n <namespace>

# Restart Prometheus to reload configuration
kubectl delete pod -n prometheus prometheus-kube-prometheus-prometheus-0

Vault Sealed

Symptoms: Vault pods show "Sealed" status

Diagnosis:

# Check Vault status
kubectl exec -n vault vault-0 -- vault status

Solution:

# Unseal Vault (requires unseal keys)
kubectl exec -n vault vault-0 -- vault operator unseal <key-1>
kubectl exec -n vault vault-0 -- vault operator unseal <key-2>
kubectl exec -n vault vault-0 -- vault operator unseal <key-3>

# Repeat for each Vault pod (vault-0, vault-1, vault-2)

Prevention: Configure auto-unseal with cloud KMS for production

Certificate Errors

Symptoms: TLS errors when accessing services

Diagnosis:

# Check certificates
kubectl get certificates -A

# Check certificate requests
kubectl get certificaterequests -A

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager

Solutions:

# Check certificate status
kubectl describe certificate <cert-name> -n <namespace>

# Force certificate renewal
kubectl delete secret <tls-secret-name> -n <namespace>
kubectl delete certificaterequest -n <namespace> --all

# Restart cert-manager
kubectl rollout restart deployment cert-manager -n cert-manager

Maintenance Procedures

Regular Maintenance Schedule

Task	Frequency	Procedure
Check cluster resource usage	Daily	Run health checks, ensure <70% utilization
Review failed pods/jobs	Daily	Check for CrashLoopBackOff, fix issues
Check ArgoCD sync status	Daily	Ensure all apps synced
Review Prometheus alerts	Daily	Check Grafana for active alerts
Update component configurations	Weekly	Apply configuration changes via GitOps
Review security scan results	Weekly	Check SonarQube and Trivy reports
Rotate Vault secrets	Monthly	Follow secret rotation procedures
Update platform components	Monthly	Apply security patches and updates
Review DORA metrics	Weekly	Check DevLake dashboards
Backup critical data	Daily	Backup databases and persistent volumes

Backup Procedures

Backup Vault Data

# Backup Vault data
kubectl exec -n vault vault-0 -- vault operator raft snapshot save /tmp/vault-backup.snap
kubectl cp vault/vault-0:/tmp/vault-backup.snap ./vault-backup-$(date +%Y%m%d).snap

Backup Backstage Database

# Backup PostgreSQL (Backstage)
kubectl exec -n backstage db-backstage-dev-1 -- pg_dump -U postgres backstage > backstage-backup-$(date +%Y%m%d).sql

Backup ArgoCD Configuration

# Export ArgoCD applications
argocd app list -o yaml > argocd-apps-backup-$(date +%Y%m%d).yaml

# Backup ArgoCD secrets
kubectl get secrets -n argocd -o yaml > argocd-secrets-backup-$(date +%Y%m%d).yaml

Backup Jenkins Configuration

# Export Jenkins configuration (JCasC)
kubectl get configmap jenkins-casc-config -n jenkins -o yaml > jenkins-config-backup-$(date +%Y%m%d).yaml

# Backup Jenkins jobs (if not in Git)
kubectl exec -n jenkins jenkins-0 -- tar czf /tmp/jenkins-jobs.tar.gz /var/jenkins_home/jobs
kubectl cp jenkins/jenkins-0:/tmp/jenkins-jobs.tar.gz ./jenkins-jobs-backup-$(date +%Y%m%d).tar.gz

Update Procedures

Update Platform Components via ArgoCD

# 1. Update image tag in Git repository
cd gitops-repo
vi platform/<component>/values.yaml
# Update image.tag to new version

# 2. Commit and push
git add .
git commit -m "Update <component> to version <version>"
git push

# 3. Sync ArgoCD application
argocd app sync <app-name>

# 4. Monitor rollout
kubectl rollout status deployment/<deployment-name> -n <namespace>

# 5. Verify health
kubectl get pods -n <namespace>

Emergency Rollback

# Rollback via ArgoCD
argocd app rollback <app-name> <revision-id>

# Or rollback via kubectl
kubectl rollout undo deployment/<deployment-name> -n <namespace>

# Verify rollback
kubectl rollout status deployment/<deployment-name> -n <namespace>

Emergency Response

Severity Levels

Severity	Definition	Response Time	Escalation
P0 - Critical	Platform completely down, no users can access	Immediate	Page on-call engineer
P1 - High	Major feature broken, significant user impact	< 30 minutes	Notify team lead
P2 - Medium	Minor feature issue, workaround available	< 4 hours	Create ticket
P3 - Low	Cosmetic issue, no functional impact	Next business day	Backlog

Emergency Contacts

# Update with actual contact information
Platform Team Lead: <name> <email> <phone>
SRE On-Call: <name> <email> <phone>
Security Team: <name> <email> <phone>
Cloud Provider Support: <phone> <portal-link>

Incident Response Playbook

Step 1: Assess Severity

# Check overall cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running

# Check critical services
kubectl get pods -n argocd
kubectl get pods -n backstage
kubectl get pods -n jenkins

Step 2: Gather Information

# Collect logs
kubectl logs -n <namespace> <pod-name> > incident-logs.txt

# Collect events
kubectl get events -A --sort-by='.lastTimestamp' > incident-events.txt

# Check resource usage
kubectl top nodes > incident-resources.txt
kubectl top pods -A >> incident-resources.txt

Step 3: Immediate Mitigation

# If a component is causing issues, scale it down
kubectl scale deployment <deployment-name> -n <namespace> --replicas=0

# If resource exhaustion, delete completed jobs/pods
kubectl delete pod --field-selector status.phase=Succeeded -A
kubectl delete pod --field-selector status.phase=Failed -A

# If storage full, clean up old data
kubectl exec -n <namespace> <pod-name> -- df -h
# Identify and remove large files

Step 4: Communicate

# Incident notification template

Subject: [P0/P1/P2] <Brief Description> - Fawkes Platform

Status: Investigating/Mitigating/Resolved
Impact: <Description of user impact>
Start Time: <timestamp>
Current Status: <Current state>
Next Update: <Expected time>

Actions Taken:

- <Action 1>
- <Action 2>

Next Steps:

- <Next step 1>
- <Next step 2>

Step 5: Restore Service

# Apply fix (rollback, restart, scale up, etc.)
# Verify fix
kubectl get pods -n <namespace>
kubectl logs -n <namespace> <pod-name>

# Run smoke tests
# Access UI and verify functionality

Step 6: Post-Incident

# Post-incident review template

1. Timeline of events
2. Root cause analysis
3. Impact assessment
4. Actions taken
5. Lessons learned
6. Action items to prevent recurrence

Health Checks

Automated Health Check Script

Save this as scripts/health-check.sh:

#!/bin/bash

echo "=== Fawkes Platform Health Check ==="
echo "Timestamp: $(date)"
echo

# Cluster health
echo "--- Cluster Health ---"
kubectl get nodes
echo

# Resource utilization
echo "--- Resource Utilization ---"
kubectl top nodes
echo

# Critical namespaces
echo "--- Critical Namespaces ---"
kubectl get namespaces | grep -E 'argocd|backstage|jenkins|prometheus|grafana|vault'
echo

# Pod health
echo "--- Pod Health (Non-Running) ---"
kubectl get pods -A | grep -v Running | grep -v Completed
echo

# ArgoCD sync status
echo "--- ArgoCD Applications ---"
kubectl get applications -n argocd -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status
echo

# Prometheus targets
echo "--- Prometheus Targets (Down) ---"
kubectl exec -n prometheus prometheus-kube-prometheus-prometheus-0 -- \
  wget -qO- 'http://localhost:9090/api/v1/targets' | \
  jq -r '.data.activeTargets[] | select(.health!="up") | "\(.labels.job) - \(.health)"'
echo

# Certificate expiry
echo "--- Certificates Expiring Soon ---"
kubectl get certificates -A -o json | \
  jq -r '.items[] | select(.status.notAfter) | "\(.metadata.namespace)/\(.metadata.name): \(.status.notAfter)"'
echo

# Vault seal status
echo "--- Vault Seal Status ---"
for i in 0 1 2; do
  echo -n "vault-$i: "
  kubectl exec -n vault vault-$i -- vault status | grep "Sealed" || echo "Check failed"
done
echo

echo "=== Health Check Complete ==="

Make it executable:

chmod +x scripts/health-check.sh

Run daily or on-demand:

./scripts/health-check.sh

Change Log

Date	Version	Changes	Author
2024-12	1.0	Initial Epic 1 runbook	Platform Team

Epic 1: Platform Operations Runbook

Table of Contents

Overview

Component Status Checks

Quick Health Check (All Components)

Kubernetes Cluster

ArgoCD (GitOps)

Backstage (Developer Portal)

Jenkins (CI/CD)

SonarQube (Code Quality)

Vault (Secrets Management)

Kyverno (Policy Engine)

Prometheus & Grafana (Observability)

Harbor (Container Registry)

DevLake (DORA Metrics)

Common Operations

Restarting a Component

Scaling Components

Updating Component Configuration

Checking Logs

Viewing Events

Troubleshooting

Pod Not Starting

Service Not Accessible

ArgoCD Application Out of Sync

Jenkins Build Failures

Prometheus Metrics Missing

Vault Sealed

Certificate Errors

Maintenance Procedures

Regular Maintenance Schedule

Backup Procedures

Backup Vault Data

Backup Backstage Database

Backup ArgoCD Configuration

Backup Jenkins Configuration

Update Procedures

Update Platform Components via ArgoCD

Emergency Rollback

Emergency Response

Severity Levels

Emergency Contacts

Incident Response Playbook

Step 1: Assess Severity

Step 2: Gather Information

Step 3: Immediate Mitigation

Step 4: Communicate

Step 5: Restore Service

Step 6: Post-Incident

Health Checks

Automated Health Check Script

Related Documentation

Change Log