Policy as Code Tiers: Audit vs. Enforce
Context
Security policies are necessary but enforcement is a spectrum, not binary. Enforcing every security best practice from day one would:
- Block legitimate developer workflows
- Create change resistance ("Platform team said NO again")
- Slow down delivery (waiting for exceptions, workarounds)
- Reduce platform adoption (developers route around strict controls)
The solution is tiered policy enforcement: critical policies enforced immediately, important policies audited before enforced, best practices remain advisory.
This document explains Fawkes' Policy-as-Code Tiers framework using Kyverno, and why gradual enforcement is better than "big bang" compliance.
The Problem: All-or-Nothing Enforcement Fails
Scenario 1: The Zero-Tolerance Trap
Platform Team Decision: "All pods must be non-root, no exceptions. Security is non-negotiable."
Week 1: Policy enforced across all namespaces
Week 2: Developers revolt
- Legacy app runs as root (vendor-provided image, can't change)
- Database StatefulSets need elevated privileges
- Debugging containers blocked (troubleshooting tools need root)
Week 3: Platform team creates exception process
- Developers submit tickets requesting policy exemptions
- Platform team reviews each request manually
- Ticket backlog: 40+ exception requests
Week 4: Developers route around platform
- Deploy to "dev" namespace (policies disabled to "not slow down development")
- Promote to prod via kubectl (bypassing GitOps)
- Shadow IT: Run workloads in separate clusters outside platform team control
Outcome: Policy exists on paper, violated in practice. Worse security posture than gradual enforcement.
Scenario 2: The Permissive Paralysis
Platform Team Decision: "We'll make policies advisory-only. Developers know best."
Week 1: Policies in audit mode, violations logged
Month 3: 10,000+ policy violations logged, nobody fixes them
Month 6: Real security incident
- Privileged container exploited
- Attacker pivots to other workloads
- Root cause: Container ran as root (policy violation ignored for 6 months)
Outcome: Audit-only policies without enforcement roadmap = no improvement.
The Solution: Tiered Enforcement Model
Fawkes uses a three-tier policy framework:
| Tier | Enforcement | Scope | Examples |
|---|---|---|---|
| Tier 1: Critical | ✅ Enforce | Immediate security risks | No privileged escalation, no host network, approved registries only |
| Tier 2: Important | ⚠️ Audit → Enforce | Hardening and compliance | Resource limits, non-root users, read-only root filesystem |
| Tier 3: Best Practice | ℹ️ Audit Only | Operational excellence | Recommended labels, deployment strategies, cost tagging |
graph LR
A[New Policy] -->|Security Critical?| B{Tier 1}
B -->|Yes| C[Enforce Immediately]
B -->|No| D{Tier 2}
D -->|Important| E[Audit → Enforce]
D -->|Nice-to-Have| F[Audit Only]
style C fill:#ff6b6b
style E fill:#FFA726
style F fill:#4CAF50
Tier 1: Critical (Enforce Immediately)
Philosophy: Block obvious security holes that no legitimate workload should need.
Examples:
1. No Privileged Containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-privileged-containers
annotations:
policies.kyverno.io/tier: critical
policies.kyverno.io/severity: high
spec:
validationFailureAction: Enforce # Block deployment
background: true
rules:
- name: privileged-containers
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Privileged containers are not allowed"
pattern:
spec:
containers:
- =(securityContext):
=(privileged): "false"
Rationale:
- Privileged containers can break out to host
- Virtually no legitimate use case (0.1% of workloads)
- Risk of blocking: Very low
- Risk of allowing: Very high
2. No Host Namespaces
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-host-namespaces
spec:
validationFailureAction: Enforce
rules:
- name: host-namespaces
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Sharing host namespaces is not allowed"
pattern:
spec:
=(hostNetwork): "false"
=(hostIPC): "false"
=(hostPID): "false"
Rationale:
- Host network/PID/IPC access = container breakout vector
- Legitimate use cases: <1% (CNI plugins, monitoring agents)
- Those exceptions: Platform team manages, not developers
3. Images from Approved Registries Only
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-image-registries
spec:
validationFailureAction: Enforce
rules:
- name: approved-registries
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Images must be from approved registries: harbor.fawkes.io, gcr.io/fawkes"
pattern:
spec:
containers:
- image: "harbor.fawkes.io/* | gcr.io/fawkes/*"
Rationale:
- Public Docker Hub images: unvetted, potential supply chain attacks
- Approved registries: scanned for vulnerabilities, signed
- Risk: Developer pulls
random/cryptominer:latest(supply chain attack)
Tier 1 Exception Process: Manual approval by security team, documented justification, time-limited exemption.
Tier 2: Important (Audit → Enforce)
Philosophy: Security hardening that requires migration time for existing workloads.
Rollout Process:
| Phase | Duration | Action | Developer Impact |
|---|---|---|---|
| Phase 1: Audit | Month 1-2 | Log violations, no blocking | None (visibility only) |
| Phase 2: Alert | Month 3 | Weekly reports to teams | Awareness |
| Phase 3: Warn | Month 4 | Deployments succeed with warning | Noise |
| Phase 4: Enforce | Month 5+ | Block non-compliant deployments | Compliance required |
Examples:
1. Require Resource Limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
annotations:
policies.kyverno.io/tier: important
spec:
validationFailureAction: Audit # Will become Enforce after migration period
background: true
rules:
- name: check-resource-limits
match:
any:
- resources:
kinds:
- Pod
validate:
message: "CPU and memory limits are required"
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
Why Gradual:
- Legacy apps may not have limits set (works fine, but risky)
- Impact analysis needed: What limits are appropriate?
- Migration time: Teams need to test with limits, adjust
Rollout:
- Month 1 (Audit): Platform team identifies 120 pods without limits
- Month 2 (Alert): Teams notified via Slack, given template configs
- Month 3 (Warn): Deployments warn but succeed, countdown timer shown
- Month 4 (Enforce): Policy enforced, non-compliant deployments blocked
2. Non-Root Users
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-non-root
spec:
validationFailureAction: Audit # Gradual rollout
rules:
- name: non-root-user
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Containers must run as non-root user"
pattern:
spec:
containers:
- securityContext:
runAsNonRoot: true
Why Gradual:
- Many images default to root user
- Developer needs to rebuild image or override in security context
- Migration: Takes time to rebuild images
3. Read-Only Root Filesystem
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-read-only-root-filesystem
spec:
validationFailureAction: Audit
rules:
- name: read-only-root-fs
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Root filesystem should be read-only"
pattern:
spec:
containers:
- securityContext:
readOnlyRootFilesystem: true
Why Gradual:
- Apps that write to local filesystem must be refactored (write to emptyDir volume instead)
- Requires code changes, testing
Tier 2 Exception Process: Team lead approval, documented technical reason, review every 90 days.
Tier 3: Best Practice (Audit Only)
Philosophy: Encourage good practices but never block. Use visibility to drive cultural change.
Examples:
1. Recommended Labels
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: recommended-labels
spec:
validationFailureAction: Audit # Never enforced
rules:
- name: check-labels
match:
any:
- resources:
kinds:
- Deployment
validate:
message: "Recommended labels missing: app.kubernetes.io/name, app.kubernetes.io/version, team"
pattern:
metadata:
labels:
app.kubernetes.io/name: "?*"
app.kubernetes.io/version: "?*"
team: "?*"
Why Audit-Only:
- Labels helpful for observability, cost allocation, ownership
- But lack of labels doesn't create security risk
- Carrot, not stick: Show teams their compliance score, let them improve voluntarily
2. Deployment Strategies
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: recommended-deployment-strategy
spec:
validationFailureAction: Audit
rules:
- name: rolling-update
match:
any:
- resources:
kinds:
- Deployment
validate:
message: "Recommended strategy: RollingUpdate (not Recreate)"
pattern:
spec:
strategy:
type: RollingUpdate
Why Audit-Only:
- RollingUpdate preferred (zero-downtime deployments)
- But Recreate valid for stateful apps, development environments
- Education: Show metric "85% of deployments use RollingUpdate", team can improve
3. Cost Allocation Tags
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: cost-allocation-tags
spec:
validationFailureAction: Audit
rules:
- name: cost-center-label
match:
any:
- resources:
kinds:
- Deployment
validate:
message: "Recommended label for cost tracking: cost-center"
pattern:
metadata:
labels:
cost-center: "?*"
Why Audit-Only:
- Helps finance team allocate cloud costs
- But not a security requirement
- Gradual adoption: Teams add cost-center labels for accurate chargeback
Policy Lifecycle: From Audit to Enforce
The Migration Path
graph LR
A[New Policy Created] -->|Week 1| B[Audit Mode]
B -->|Week 4| C[Analyze Violations]
C -->|Week 5-8| D[Team Migration]
D -->|Week 9| E{Ready?}
E -->|Yes| F[Enforce Mode]
E -->|No| G[Extend Timeline]
G --> D
style F fill:#4CAF50
Example: "Require Non-Root Users" Policy
Week 1-4: Audit Phase
- Policy deployed in audit mode
- Violations logged to PolicyReport CRD
- Platform team analyzes impact
Report:
apiVersion: wgpolicyk8s.io/v1alpha2
kind: PolicyReport
metadata:
name: require-non-root-violations
summary:
pass: 42
fail: 18 # 18 pods violating policy
warn: 0
error: 0
results:
- policy: require-non-root
rule: non-root-user
result: fail
resources:
- kind: Pod
namespace: api-gateway
name: api-pod-abc123
message: "Container 'api' runs as root user"
Week 5-8: Migration Phase
-
Communication: Email teams with violations
-
"Your
api-gatewayservice runs as root. Here's how to fix..." -
Link to documentation, example configs
-
Office Hours: Weekly drop-in sessions for teams needing help
-
Tracking: Dashboard shows compliance trend
- Week 5: 70% compliant
- Week 6: 80% compliant
- Week 7: 90% compliant
Week 9: Enforcement Decision
- ✅ 90%+ compliant → Switch to enforce mode
- ⚠️ <90% compliant → Extend migration timeline
Week 10+: Enforce Mode
spec:
validationFailureAction: Enforce # Now blocking
Remaining violations: Platform team works 1-on-1 with teams or grants time-limited exemptions.
Governance: Who Decides What Tier?
Policy Review Board
Members:
- Platform Lead
- Security Engineer
- 2x Developer Representatives (rotates quarterly)
Responsibilities:
- Review new policies: Assign tier (Critical, Important, Best Practice)
- Monitor audit-to-enforce transitions: Are teams ready?
- Handle exception requests: Approve/deny Tier 1 exemptions
- Quarterly policy review: Should Tier 2 policies move to Tier 1? Retire unused policies?
Example Decision:
Proposed Policy: "All deployments must use Argo Rollouts for canary deployments"
Discussion:
- Security: Not a security issue (Tier 3 at most)
- Developer: This would block 80% of current deployments, high friction
- Platform: Canary is best practice but not mandatory for low-traffic apps
Decision: Tier 3 (Audit Only)
- Encourage adoption via education
- Show metric: "Teams using canary have 50% lower CFR"
- Don't enforce (too disruptive for marginal benefit)
Exception Request Process
Tier 1 (Critical) Exception:
- Submit ticket with justification
- Security team reviews
- Approval requires:
- Technical necessity (no alternative)
- Compensating controls (extra monitoring, isolated namespace)
- Time limit (expires in 90 days, must re-justify)
- Documented in audit log
Tier 2 (Important) Exception:
- Team lead approves via annotation
metadata: annotations: policy.fawkes.io/exception: "require-non-root" policy.fawkes.io/reason: "Legacy vendor image, migration planned Q2" policy.fawkes.io/expires: "2024-06-30" - Reviewed quarterly
- Exemption expires automatically, must renew
Tier 3 (Best Practice) Exception:
- No exception needed (audit-only)
- Teams can ignore violations without approval
Trade-Offs: Security vs. Velocity
What Tiered Enforcement Gives You
| Benefit | Impact |
|---|---|
| Gradual Adoption | Teams adapt to policies over time, not shocked by "big bang" |
| Developer Trust | Platform team seen as partner, not blocker |
| Higher Compliance | 90%+ adoption vs. 50% with all-or-nothing enforcement |
| Risk Mitigation | Critical risks blocked immediately, lower risks addressed iteratively |
| Continuous Improvement | Policies evolve based on real-world usage, not theory |
What Tiered Enforcement Costs You
| Challenge | Mitigation |
|---|---|
| Complexity | Managing three tiers, transition timelines. Mitigation: Automation (policy reports, dashboards) |
| Incomplete Security | Tier 2/3 violations exist during migration. Mitigation: Acceptable risk for gradual improvement |
| Communication Overhead | Must explain tier decisions, migration plans. Mitigation: Transparent documentation, regular updates |
| Exception Sprawl | Teams request exemptions to avoid work. Mitigation: Time-limited exemptions, quarterly review |
The Bet: 90% compliance with gradual enforcement beats 50% compliance with strict enforcement.
Metrics: Measuring Policy Success
Compliance Dashboard
Fawkes Policy Compliance (November 2024)
Overall Compliance: 87% (↑ 5% vs. last month)
Tier 1 (Critical) - Enforced:
Compliance: 98% ✅
Violations: 3 (all approved exemptions)
Tier 2 (Important) - Audit → Enforce:
Compliance: 82% ⚠️
Violations: 27
On Track to Enforce: March 2025
Tier 3 (Best Practice) - Audit Only:
Compliance: 65% ℹ️
Violations: 105
Trend: ↑ 10% over 6 months (voluntary adoption)
Top Violations (Tier 2):
1. Missing resource limits (18 pods)
2. Non-root users (7 pods)
3. Read-only filesystem (2 pods)
Action Items:
- Office hours scheduled for teams with resource limit violations
- Automated mutation policy proposed to add default limits
Policy Effectiveness Metrics
| Metric | Target | Current |
|---|---|---|
| Tier 1 Compliance | >95% | 98% ✅ |
| Tier 2 Compliance | >85% before enforcement | 82% ⚠️ |
| Tier 2 Enforcement Deadlines Met | >80% on-time | 100% ✅ |
| Exception Requests (Tier 1) | <5 active | 3 ✅ |
| Time to Remediate Violations | <30 days | 22 days ✅ |
Related Reading
- ADR: ADR-017: Kyverno Policy Engine
- Explanation: Zero Trust Security Model
- How-To: Troubleshoot Kyverno Violations
- Reference: Kyverno Policy Library
Conclusion
Security policies are necessary, but enforcement is an art, not a science.
The tiered enforcement model recognizes that:
- Not all policies are equal - Some prevent immediate breaches (enforce now), others harden over time (audit first)
- Developers need migration time - Sudden enforcement creates resistance and shadow IT
- Culture beats compliance - Gradual adoption builds trust, all-or-nothing breeds resentment
- Perfect is the enemy of good - 90% compliance with partnership beats 50% compliance with force
The Framework:
- Tier 1 (Critical): Enforce immediately, exceptions rare
- Tier 2 (Important): Audit → Enforce with migration timeline
- Tier 3 (Best Practice): Audit only, encourage via visibility
The Philosophy: Security is a journey, not a destination. Meet developers where they are, guide them to where they should be.
This is Policy-as-Code with empathy.