ADR-010: Ingress Controller for Service Access
Status
Accepted
Context
The Fawkes platform integrates multiple services that require external access:
Core Services:
- Backstage (Developer Portal & Dojo Hub) - Primary user interface
- Mattermost (Team Collaboration) - Chat and collaboration
- Focalboard (Project Management) - Kanban boards and planning
- ArgoCD (GitOps UI) - Deployment visualization
- Jenkins (CI/CD UI) - Pipeline management
- Grafana (Observability) - Metrics and dashboards
- Harbor (Container Registry UI) - Image management
- SonarQube (Code Quality) - Security and quality reports
Access Requirements:
- Single entry point with consistent domain structure
- TLS/SSL encryption for all services
- Path-based or subdomain-based routing
- Rate limiting and DDoS protection
- Authentication integration (OIDC/OAuth2)
- Certificate management automation
- Load balancing across replicas
- WebSocket support (Mattermost, ArgoCD, Jenkins)
- Health check integration
- Request/response logging for security auditing
Technical Constraints:
- Must work across AWS, Azure, GCP, and on-premises environments
- Should integrate with cert-manager for automated certificate provisioning
- Must support both path-based (/backstage) and subdomain-based (backstage.fawkes.example.com) routing
- Should minimize cloud provider lock-in
- Must support learner environments with dynamic provisioning
- Should provide observability (request metrics, tracing)
Security Requirements:
- Force HTTPS/TLS for all traffic
- Support for custom certificates and Let's Encrypt
- Web Application Firewall (WAF) capabilities
- Rate limiting per service and per IP
- DDoS protection
- Security headers (HSTS, CSP, X-Frame-Options)
- IP whitelisting capabilities for sensitive services
Operational Requirements:
- Easy configuration via annotations or CRDs
- Automatic service discovery
- Rolling updates without downtime
- Clear error pages and troubleshooting
- Integration with platform monitoring
Decision
We will use NGINX Ingress Controller as the primary ingress solution for the Fawkes platform.
Architecture
Internet
|
v
┌─────────────────────────────────────────┐
│ Cloud Load Balancer (AWS NLB/ALB) │
│ (Optional, for production) │
└─────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────┐
│ NGINX Ingress Controller │
│ - TLS Termination │
│ - Path/Subdomain Routing │
│ - Rate Limiting │
│ - Authentication (OAuth2 Proxy) │
└─────────────────────────────────────────┘
|
├──> Backstage Service (/)
├──> Mattermost Service (/mattermost)
├──> Focalboard Service (/focalboard)
├──> ArgoCD Service (/argocd)
├──> Jenkins Service (/jenkins)
├──> Grafana Service (/grafana)
├──> Harbor Service (/harbor)
└──> SonarQube Service (/sonarqube)
Routing Strategy
Primary: Subdomain-Based Routing (Production)
https://backstage.fawkes.example.com → Backstage
https://chat.fawkes.example.com → Mattermost
https://boards.fawkes.example.com → Focalboard
https://cd.fawkes.example.com → ArgoCD
https://ci.fawkes.example.com → Jenkins
https://metrics.fawkes.example.com → Grafana
https://registry.fawkes.example.com → Harbor
https://quality.fawkes.example.com → SonarQube
Alternative: Path-Based Routing (Development/Learning)
https://fawkes.example.com/ → Backstage
https://fawkes.example.com/chat → Mattermost
https://fawkes.example.com/boards → Focalboard
https://fawkes.example.com/cd → ArgoCD
https://fawkes.example.com/ci → Jenkins
https://fawkes.example.com/metrics → Grafana
https://fawkes.example.com/registry → Harbor
https://fawkes.example.com/quality → SonarQube
Certificate Management
Integration with cert-manager for automated certificate provisioning:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: platform-team@example.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
Example Ingress Configuration
Backstage (Primary Portal):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: backstage
namespace: fawkes-core
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.2 TLSv1.3"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
ingressClassName: nginx
tls:
- hosts:
- backstage.fawkes.example.com
secretName: backstage-tls
rules:
- host: backstage.fawkes.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: backstage
port:
number: 7007
Mattermost (WebSocket Support):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: mattermost
namespace: fawkes-collaboration
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
nginx.ingress.kubernetes.io/websocket-services: "mattermost"
spec:
ingressClassName: nginx
tls:
- hosts:
- chat.fawkes.example.com
secretName: mattermost-tls
rules:
- host: chat.fawkes.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: mattermost
port:
number: 8065
Security Configuration
Rate Limiting:
annotations:
nginx.ingress.kubernetes.io/limit-rps: "10"
nginx.ingress.kubernetes.io/limit-connections: "20"
IP Whitelisting (for admin services):
annotations:
nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,172.16.0.0/12"
Security Headers:
annotations:
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers "X-Frame-Options: DENY";
more_set_headers "X-Content-Type-Options: nosniff";
more_set_headers "X-XSS-Protection: 1; mode=block";
more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains";
OAuth2 Proxy Integration
For services without native SSO support:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana
annotations:
nginx.ingress.kubernetes.io/auth-url: "https://oauth2.fawkes.example.com/oauth2/auth"
nginx.ingress.kubernetes.io/auth-signin: "https://oauth2.fawkes.example.com/oauth2/start?rd=$escaped_request_uri"
Monitoring Integration
NGINX Ingress Controller exposes Prometheus metrics:
- Request rate per service
- Request duration percentiles
- Error rates (4xx, 5xx)
- Bytes transferred
- Upstream response time
Grafana dashboards: NGINX Ingress Controller (official dashboard ID: 9614)
Deployment Configuration
NGINX Ingress Controller Helm Values:
controller:
replicaCount: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
metrics:
enabled: true
serviceMonitor:
enabled: true
config:
use-forwarded-headers: "true"
compute-full-forwarded-for: "true"
use-proxy-protocol: "false"
enable-real-ip: "true"
proxy-body-size: "50m"
ssl-protocols: "TLSv1.2 TLSv1.3"
ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384"
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "10254"
Consequences
Positive
- Cloud Agnostic: NGINX Ingress works identically across AWS, Azure, GCP, and on-premises
- Mature & Proven: Battle-tested in production at massive scale, large community support
- Feature Rich: Comprehensive feature set including rate limiting, WebSocket, authentication
- Observable: Native Prometheus metrics integration
- Flexible Routing: Supports both path-based and subdomain-based routing strategies
- Cost Effective: Open source with no licensing costs
- Well Documented: Extensive documentation, examples, and community resources
- Security Hardened: Regular security updates, CVE tracking, hardening guides available
- GitOps Friendly: Declarative YAML configuration fits ArgoCD workflow
- Learning Friendly: Simple annotation-based configuration good for dojo learners
Negative
- Resource Overhead: NGINX controller pods consume cluster resources (mitigated by proper sizing)
- Single Point of Failure: Requires 3+ replicas for high availability
- Configuration Complexity: Advanced features require learning NGINX-specific annotations
- Path-Based Routing Limitations: Some applications (like Mattermost) work better with subdomain routing
- Certificate Management Dependency: Requires cert-manager for automated TLS (adds complexity)
- Reload on Configuration Change: Configuration changes trigger NGINX reload (brief traffic disruption)
Neutral
- Load Balancer Costs: Cloud load balancers incur costs (AWS NLB ~$16/month + data transfer)
- Monitoring Overhead: Requires Prometheus/Grafana for observability
- DNS Management: Subdomain routing requires wildcard DNS or multiple A records
- Learning Curve: Platform team must understand NGINX configuration paradigms
Alternatives Considered
Alternative 1: Traefik
Pros:
- Native Let's Encrypt integration (no cert-manager needed)
- Dynamic configuration via labels
- Built-in dashboard for traffic visualization
- Smaller resource footprint
- Excellent WebSocket support
- Modern, actively developed
Cons:
- Smaller community compared to NGINX
- Less enterprise adoption
- More limited advanced features (rate limiting, WAF)
- Documentation less comprehensive for complex scenarios
- Fewer third-party integrations
Reason for Rejection: While Traefik is excellent and modern, NGINX Ingress has broader enterprise adoption, more comprehensive documentation for complex scenarios, and better alignment with DORA best practices documentation. The larger community makes it easier for Fawkes learners to find troubleshooting resources.
Alternative 2: Istio/Envoy Service Mesh
Pros:
- Full service mesh capabilities (mTLS, traffic management, observability)
- Advanced traffic routing (A/B testing, canary, circuit breaking)
- Superior observability (distributed tracing, detailed metrics)
- Built-in security (zero-trust networking)
- Envoy is modern, high-performance proxy
Cons:
- Significant complexity overhead (steep learning curve)
- High resource consumption (sidecar for every pod)
- Operational burden (control plane management, upgrades)
- Overkill for ingress-only use case
- Adds 6-8 weeks to MVP timeline
- Too complex for learner environments
Reason for Rejection: Service mesh capabilities are valuable but represent over-engineering for MVP. Fawkes needs ingress management, not full service mesh. Istio/Envoy can be considered post-MVP as "Advanced Networking" module in Brown Belt curriculum.
Alternative 3: Kong Ingress Controller
Pros:
- API gateway features (rate limiting, authentication, transformation)
- Plugin ecosystem for extensibility
- Enterprise version available with support
- Good for API-heavy platforms
- Lua-based customization
- Built-in developer portal
Cons:
- More complex than pure ingress controller
- Requires PostgreSQL for production (additional dependency)
- Licensing considerations (enterprise features)
- Smaller community than NGINX
- API gateway features not needed for internal platform
Reason for Rejection: Kong's strength is API management, which is not a primary Fawkes requirement. The additional complexity and PostgreSQL dependency don't provide sufficient value for our use case. NGINX provides everything we need without API gateway overhead.
Alternative 4: HAProxy Ingress
Pros:
- Extremely high performance and efficiency
- Very low resource consumption
- Battle-tested load balancing capabilities
- Excellent documentation
- Used by major internet properties
Cons:
- Smaller Kubernetes community compared to NGINX
- Less flexible annotation-based configuration
- Fewer examples and tutorials for Kubernetes
- Less integration with cloud-native ecosystem
- Limited WebSocket support compared to NGINX
Reason for Rejection: While HAProxy is excellent for performance-critical scenarios, NGINX Ingress provides better Kubernetes-native integration, more comprehensive documentation for learners, and broader community support. HAProxy's performance advantages are not critical for Fawkes' scale.
Alternative 5: Cloud Provider Ingress (AWS ALB, GCP GCLB, Azure App Gateway)
Pros:
- Native cloud integration
- Managed service (no controller to maintain)
- Tight security integration (IAM, security groups)
- Automatic scaling
- Lower operational overhead
Cons:
- Cloud vendor lock-in (violates Fawkes portability principle)
- Inconsistent behavior across clouds
- Limited customization compared to NGINX
- Annotations differ per cloud
- Cannot run on-premises or in learner laptops
- Makes dojo lab provisioning cloud-specific
Reason for Rejection: Violates core Fawkes principle of cloud portability. Learners need consistent experience across environments. Platform teams should be able to deploy Fawkes anywhere, including on-premises or local Kubernetes clusters. Cloud ingress controllers prevent this flexibility.
Alternative 6: Contour (Envoy-based)
Pros:
- Uses Envoy proxy (modern, high-performance)
- Simpler than full Istio deployment
- Good HTTPProxy CRD for advanced routing
- CNCF project with growing community
- Excellent for multi-tenancy
Cons:
- Smaller community and ecosystem than NGINX
- Less mature documentation
- Fewer examples and tutorials
- Less enterprise adoption
- Not as feature-complete for edge use cases
Reason for Rejection: While Contour is a good middle ground between NGINX and Istio, its smaller community and less mature documentation make it less suitable for a learning-focused platform. NGINX's extensive resources better support dojo learners troubleshooting issues independently.
Implementation Plan
Phase 1: MVP (Week 3 of Sprint 01)
-
Deploy NGINX Ingress Controller [4 hours]
-
Install via Helm chart
- Configure for AWS NLB (or equivalent)
- Verify controller pods running
-
Test basic HTTP routing
-
Deploy cert-manager [2 hours]
-
Install cert-manager via Helm
- Create ClusterIssuer for Let's Encrypt staging
- Test certificate provisioning
-
Create production ClusterIssuer
-
Create Ingress for Backstage [2 hours]
-
Subdomain-based routing (backstage.fawkes.dev)
- TLS certificate from Let's Encrypt
- Force HTTPS redirect
-
Test end-to-end access
-
Document Standard Ingress Pattern [2 hours]
- Create ingress template for new services
- Document annotation patterns
- Create troubleshooting guide
- Add to Dojo Module 2 curriculum
Phase 2: Core Services (Weeks 4-5)
-
Deploy Ingress for Collaboration Services [4 hours]
-
Mattermost with WebSocket support
- Focalboard (integrated with Mattermost)
-
Test real-time features
-
Deploy Ingress for CI/CD Services [4 hours]
-
Jenkins with authentication
- ArgoCD with SSO
-
Harbor with rate limiting
-
Deploy Ingress for Observability Services [3 hours]
- Grafana with OAuth2 proxy
- Prometheus (internal only, IP whitelist)
- OpenSearch dashboards
Phase 3: Security & Monitoring (Week 6)
-
Implement Security Hardening [4 hours]
-
Configure rate limiting
- Add security headers
- IP whitelisting for admin services
-
Test DDoS protection
-
Configure Monitoring [3 hours]
-
Prometheus ServiceMonitor for NGINX metrics
- Grafana dashboard for ingress monitoring
- Alerting rules for high error rates
-
Log aggregation for access logs
-
Create Dojo Lab Automation [4 hours]
- Automated ingress provisioning for learner namespaces
- Dynamic subdomain creation (learner-01.labs.fawkes.dev)
- Wildcard certificate management
- Cleanup automation
Phase 4: Documentation & Training (Week 7)
-
Write Comprehensive Documentation [6 hours]
- Architecture overview with diagrams
- Configuration patterns and best practices
- Troubleshooting guide (common issues)
- Security hardening checklist
-
Create Dojo Module Content [4 hours]
- Yellow Belt Module: "Exposing Services with Ingress"
- Hands-on lab: Create custom ingress
- Assessment questions on TLS, routing, security
- Video walkthrough (15 minutes)
Dojo Integration
Yellow Belt - Module 4: "Exposing Services with Ingress"
Learning Objectives:
- Understand Kubernetes Ingress concepts
- Configure NGINX Ingress Controller
- Implement TLS/SSL with cert-manager
- Apply security best practices (rate limiting, headers)
- Troubleshoot common ingress issues
Hands-On Lab:
- Deploy a sample application to learner namespace
- Create Ingress resource with subdomain routing
- Configure TLS certificate via cert-manager
- Test HTTPS access and forced redirect
- Add rate limiting and security headers
- Monitor ingress metrics in Grafana
Assessment:
- Quiz on Ingress concepts (5 questions)
- Practical: Deploy and expose a new service
- Troubleshoot broken ingress configuration
Time: 90 minutes (30 min theory + 60 min hands-on)
Monitoring & Observability
Key Metrics
NGINX Ingress Controller Metrics:
nginx_ingress_controller_requests- Total requests per servicenginx_ingress_controller_request_duration_seconds- Request latency percentilesnginx_ingress_controller_response_size- Response sizesnginx_ingress_controller_ssl_expire_time_seconds- Certificate expirationnginx_ingress_controller_nginx_process_connections- Active connections
Grafana Dashboard Panels:
- Request Rate (per service, per ingress)
- Request Duration (P50, P95, P99)
- HTTP Status Codes (2xx, 4xx, 5xx rates)
- SSL Certificate Expiration Timeline
- Ingress Controller Resource Usage (CPU, memory)
- Error Rate by Service
Alerting Rules:
groups:
- name: ingress_alerts
rules:
- alert: HighErrorRate
expr: sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) / sum(rate(nginx_ingress_controller_requests[5m])) > 0.05
for: 5m
annotations:
summary: "High 5xx error rate detected"
- alert: CertificateExpiring
expr: (nginx_ingress_controller_ssl_expire_time_seconds - time()) / 86400 < 7
annotations:
summary: "TLS certificate expiring in less than 7 days"
- alert: HighLatency
expr: histogram_quantile(0.95, nginx_ingress_controller_request_duration_seconds_bucket) > 5
for: 10m
annotations:
summary: "P95 latency above 5 seconds"
Security Considerations
TLS/SSL Management
- Certificate Rotation: cert-manager automatically renews certificates 30 days before expiration
- Protocol Enforcement: Only TLSv1.2 and TLSv1.3 allowed
- Cipher Suites: Strong ciphers only, no deprecated algorithms
- HSTS: Strict-Transport-Security header enforced
Rate Limiting
Per-Service Defaults:
- Public services (Backstage): 100 requests/second per IP
- Internal services (Jenkins, ArgoCD): 50 requests/second per IP
- Admin services (Prometheus): 10 requests/second per IP
IP Whitelisting
Sensitive services restricted to:
- Corporate VPN CIDR blocks
- Platform team IP ranges
- CI/CD pipeline source IPs
Web Application Firewall (WAF)
ModSecurity integration (post-MVP):
- OWASP Core Rule Set (CRS)
- SQL injection prevention
- XSS attack blocking
- Request validation
Cost Analysis
AWS Deployment (Production)
Infrastructure:
- Network Load Balancer: $16/month + $0.006/GB data transfer
- EBS volumes (NGINX controller state): $8/month for 80GB
- Data transfer (estimated 1TB/month): $90/month
NGINX Controller Resources:
- 3 replicas × 0.5 CPU × $0.04/hour = $43/month
- 3 replicas × 512MB RAM × $0.005/hour = $5/month
Total Monthly Cost: ~$162/month
Cost Optimization:
- Use AWS ALB for learner/dev environments (cheaper)
- Reduce replica count in non-production
- Implement caching to reduce data transfer
Multi-Environment Cost Breakdown
| Environment | Load Balancer | Replicas | Monthly Cost |
|---|---|---|---|
| Production | NLB | 3 | $162 |
| Staging | ALB | 2 | $40 |
| Development | NodePort | 1 | $0 |
| Learner Labs | ALB (shared) | 2 | $40 |
Documentation Structure
For Platform Teams
-
Architecture Overview
-
Request flow diagrams
- TLS termination architecture
- Certificate management workflow
-
Multi-environment routing strategy
-
Deployment Guide
-
Helm chart installation
- Configuration recommendations
- Cloud-specific considerations
-
Troubleshooting common issues
-
Operations Runbook
- Certificate renewal procedures
- Ingress controller upgrades
- Scaling guidelines
- Incident response procedures
For Dojo Learners
-
Concepts Tutorial
-
What is an Ingress Controller?
- How TLS/SSL works
- Routing strategies comparison
-
Security best practices
-
Hands-On Lab Guide
-
Step-by-step ingress creation
- TLS configuration walkthrough
- Troubleshooting exercises
-
Real-world scenarios
-
Reference Materials
- Annotation cheat sheet
- Common patterns library
- Error message decoder
- kubectl commands reference
Migration Path
From Default Cloud Ingress
If organizations start with cloud-native ingress:
- Week 1: Deploy NGINX Ingress alongside existing ingress
- Week 2: Migrate non-critical services to NGINX
- Week 3: Validate routing, TLS, monitoring
- Week 4: Migrate critical services with rollback plan
- Week 5: Decommission cloud ingress controller
Rollback Strategy: Maintain both controllers for 2 weeks, allow instant DNS cutover
From Path-Based to Subdomain Routing
- Configure subdomain routing for new services
- Maintain path-based routing for existing services
- Gradually migrate services based on traffic patterns
- Update documentation and bookmarks
- Deprecate path-based routing after 6 months
Related Decisions
- ADR-001: Kubernetes for Container Orchestration (ingress is Kubernetes-native)
- ADR-002: Backstage for Developer Portal (primary ingress endpoint)
- ADR-009: External Secrets Operator (integrates with ingress for secrets)
- Future ADR: OAuth2 Proxy for Unified Authentication (auth layer on ingress)
- Future ADR: Service Mesh (potential Istio migration path)
References
- NGINX Ingress Controller Documentation: https://kubernetes.github.io/ingress-nginx/
- cert-manager Documentation: https://cert-manager.io/docs/
- CNCF Ingress Controller Comparison: https://docs.google.com/spreadsheets/d/191WWNpjJ2za6-nbG4ZoUMXMpUK8KlCIosvQB0f-oq3k
- OWASP TLS Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Transport_Layer_Protection_Cheat_Sheet.html
- Kubernetes Ingress Concepts: https://kubernetes.io/docs/concepts/services-networking/ingress/
Notes
Production Readiness Checklist:
- [ ] 3+ replicas for high availability
- [ ] Cloud load balancer provisioned
- [ ] TLS certificates from trusted CA (Let's Encrypt or corporate)
- [ ] Rate limiting configured
- [ ] Security headers enabled
- [ ] Monitoring dashboards created
- [ ] Alerting rules configured
- [ ] Runbook documented
- [ ] Team trained on ingress operations
Learner Environment Considerations:
- Use path-based routing to minimize DNS complexity
- Provide pre-configured ingress templates
- Automate certificate provisioning
- Create self-service ingress creation workflow
- Include ingress troubleshooting in curriculum
Last Updated
December 7, 2024 - Initial version documenting NGINX Ingress Controller selection