Policy as Code Tiers: Audit vs. Enforce

Context

Security policies are necessary but enforcement is a spectrum, not binary. Enforcing every security best practice from day one would:

Block legitimate developer workflows
Create change resistance ("Platform team said NO again")
Slow down delivery (waiting for exceptions, workarounds)
Reduce platform adoption (developers route around strict controls)

The solution is tiered policy enforcement: critical policies enforced immediately, important policies audited before enforced, best practices remain advisory.

This document explains Fawkes' Policy-as-Code Tiers framework using Kyverno, and why gradual enforcement is better than "big bang" compliance.

The Problem: All-or-Nothing Enforcement Fails

Scenario 1: The Zero-Tolerance Trap

Platform Team Decision: "All pods must be non-root, no exceptions. Security is non-negotiable."

Week 1: Policy enforced across all namespaces

Week 2: Developers revolt

Legacy app runs as root (vendor-provided image, can't change)
Database StatefulSets need elevated privileges
Debugging containers blocked (troubleshooting tools need root)

Week 3: Platform team creates exception process

Developers submit tickets requesting policy exemptions
Platform team reviews each request manually
Ticket backlog: 40+ exception requests

Week 4: Developers route around platform

Deploy to "dev" namespace (policies disabled to "not slow down development")
Promote to prod via kubectl (bypassing GitOps)
Shadow IT: Run workloads in separate clusters outside platform team control

Outcome: Policy exists on paper, violated in practice. Worse security posture than gradual enforcement.

Scenario 2: The Permissive Paralysis

Platform Team Decision: "We'll make policies advisory-only. Developers know best."

Week 1: Policies in audit mode, violations logged

Month 3: 10,000+ policy violations logged, nobody fixes them

Month 6: Real security incident

Privileged container exploited
Attacker pivots to other workloads
Root cause: Container ran as root (policy violation ignored for 6 months)

Outcome: Audit-only policies without enforcement roadmap = no improvement.

The Solution: Tiered Enforcement Model

Fawkes uses a three-tier policy framework:

Tier	Enforcement	Scope	Examples
Tier 1: Critical	✅ Enforce	Immediate security risks	No privileged escalation, no host network, approved registries only
Tier 2: Important	⚠️ Audit → Enforce	Hardening and compliance	Resource limits, non-root users, read-only root filesystem
Tier 3: Best Practice	ℹ️ Audit Only	Operational excellence	Recommended labels, deployment strategies, cost tagging

graph LR
    A[New Policy] -->|Security Critical?| B{Tier 1}
    B -->|Yes| C[Enforce Immediately]
    B -->|No| D{Tier 2}
    D -->|Important| E[Audit → Enforce]
    D -->|Nice-to-Have| F[Audit Only]

    style C fill:#ff6b6b
    style E fill:#FFA726
    style F fill:#4CAF50

Tier 1: Critical (Enforce Immediately)

Philosophy: Block obvious security holes that no legitimate workload should need.

Examples:

1. No Privileged Containers

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged-containers
  annotations:
    policies.kyverno.io/tier: critical
    policies.kyverno.io/severity: high
spec:
  validationFailureAction: Enforce # Block deployment
  background: true
  rules:
    - name: privileged-containers
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Privileged containers are not allowed"
        pattern:
          spec:
            containers:
              - =(securityContext):
                  =(privileged): "false"

Rationale:

Privileged containers can break out to host
Virtually no legitimate use case (0.1% of workloads)
Risk of blocking: Very low
Risk of allowing: Very high

2. No Host Namespaces

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-host-namespaces
spec:
  validationFailureAction: Enforce
  rules:
    - name: host-namespaces
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Sharing host namespaces is not allowed"
        pattern:
          spec:
            =(hostNetwork): "false"
            =(hostIPC): "false"
            =(hostPID): "false"

Rationale:

Host network/PID/IPC access = container breakout vector
Legitimate use cases: <1% (CNI plugins, monitoring agents)
Those exceptions: Platform team manages, not developers

3. Images from Approved Registries Only

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  rules:
    - name: approved-registries
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Images must be from approved registries: harbor.fawkes.io, gcr.io/fawkes"
        pattern:
          spec:
            containers:
              - image: "harbor.fawkes.io/* | gcr.io/fawkes/*"

Rationale:

Public Docker Hub images: unvetted, potential supply chain attacks
Approved registries: scanned for vulnerabilities, signed
Risk: Developer pulls random/cryptominer:latest (supply chain attack)

Tier 1 Exception Process: Manual approval by security team, documented justification, time-limited exemption.

Tier 2: Important (Audit → Enforce)

Philosophy: Security hardening that requires migration time for existing workloads.

Rollout Process:

Phase	Duration	Action	Developer Impact
Phase 1: Audit	Month 1-2	Log violations, no blocking	None (visibility only)
Phase 2: Alert	Month 3	Weekly reports to teams	Awareness
Phase 3: Warn	Month 4	Deployments succeed with warning	Noise
Phase 4: Enforce	Month 5+	Block non-compliant deployments	Compliance required

Examples:

1. Require Resource Limits

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
  annotations:
    policies.kyverno.io/tier: important
spec:
  validationFailureAction: Audit # Will become Enforce after migration period
  background: true
  rules:
    - name: check-resource-limits
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "CPU and memory limits are required"
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

Why Gradual:

Legacy apps may not have limits set (works fine, but risky)
Impact analysis needed: What limits are appropriate?
Migration time: Teams need to test with limits, adjust

Rollout:

Month 1 (Audit): Platform team identifies 120 pods without limits
Month 2 (Alert): Teams notified via Slack, given template configs
Month 3 (Warn): Deployments warn but succeed, countdown timer shown
Month 4 (Enforce): Policy enforced, non-compliant deployments blocked

2. Non-Root Users

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-non-root
spec:
  validationFailureAction: Audit # Gradual rollout
  rules:
    - name: non-root-user
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Containers must run as non-root user"
        pattern:
          spec:
            containers:
              - securityContext:
                  runAsNonRoot: true

Why Gradual:

Many images default to root user
Developer needs to rebuild image or override in security context
Migration: Takes time to rebuild images

3. Read-Only Root Filesystem

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-read-only-root-filesystem
spec:
  validationFailureAction: Audit
  rules:
    - name: read-only-root-fs
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Root filesystem should be read-only"
        pattern:
          spec:
            containers:
              - securityContext:
                  readOnlyRootFilesystem: true

Why Gradual:

Apps that write to local filesystem must be refactored (write to emptyDir volume instead)
Requires code changes, testing

Tier 2 Exception Process: Team lead approval, documented technical reason, review every 90 days.

Tier 3: Best Practice (Audit Only)

Philosophy: Encourage good practices but never block. Use visibility to drive cultural change.

Examples:

1. Recommended Labels

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: recommended-labels
spec:
  validationFailureAction: Audit # Never enforced
  rules:
    - name: check-labels
      match:
        any:
          - resources:
              kinds:
                - Deployment
      validate:
        message: "Recommended labels missing: app.kubernetes.io/name, app.kubernetes.io/version, team"
        pattern:
          metadata:
            labels:
              app.kubernetes.io/name: "?*"
              app.kubernetes.io/version: "?*"
              team: "?*"

Why Audit-Only:

Labels helpful for observability, cost allocation, ownership
But lack of labels doesn't create security risk
Carrot, not stick: Show teams their compliance score, let them improve voluntarily

2. Deployment Strategies

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: recommended-deployment-strategy
spec:
  validationFailureAction: Audit
  rules:
    - name: rolling-update
      match:
        any:
          - resources:
              kinds:
                - Deployment
      validate:
        message: "Recommended strategy: RollingUpdate (not Recreate)"
        pattern:
          spec:
            strategy:
              type: RollingUpdate

Why Audit-Only:

RollingUpdate preferred (zero-downtime deployments)
But Recreate valid for stateful apps, development environments
Education: Show metric "85% of deployments use RollingUpdate", team can improve

3. Cost Allocation Tags

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: cost-allocation-tags
spec:
  validationFailureAction: Audit
  rules:
    - name: cost-center-label
      match:
        any:
          - resources:
              kinds:
                - Deployment
      validate:
        message: "Recommended label for cost tracking: cost-center"
        pattern:
          metadata:
            labels:
              cost-center: "?*"

Why Audit-Only:

Helps finance team allocate cloud costs
But not a security requirement
Gradual adoption: Teams add cost-center labels for accurate chargeback

Policy Lifecycle: From Audit to Enforce

The Migration Path

graph LR
    A[New Policy Created] -->|Week 1| B[Audit Mode]
    B -->|Week 4| C[Analyze Violations]
    C -->|Week 5-8| D[Team Migration]
    D -->|Week 9| E{Ready?}
    E -->|Yes| F[Enforce Mode]
    E -->|No| G[Extend Timeline]
    G --> D

    style F fill:#4CAF50

Example: "Require Non-Root Users" Policy

Week 1-4: Audit Phase

Policy deployed in audit mode
Violations logged to PolicyReport CRD
Platform team analyzes impact

Report:

apiVersion: wgpolicyk8s.io/v1alpha2
kind: PolicyReport
metadata:
  name: require-non-root-violations
summary:
  pass: 42
  fail: 18 # 18 pods violating policy
  warn: 0
  error: 0
results:
  - policy: require-non-root
    rule: non-root-user
    result: fail
    resources:
      - kind: Pod
        namespace: api-gateway
        name: api-pod-abc123
    message: "Container 'api' runs as root user"

Week 5-8: Migration Phase

Communication: Email teams with violations
"Your api-gateway service runs as root. Here's how to fix..."
Link to documentation, example configs
Office Hours: Weekly drop-in sessions for teams needing help
Tracking: Dashboard shows compliance trend
Week 5: 70% compliant
Week 6: 80% compliant
Week 7: 90% compliant

Week 9: Enforcement Decision

✅ 90%+ compliant → Switch to enforce mode
⚠️ <90% compliant → Extend migration timeline

Week 10+: Enforce Mode

spec:
  validationFailureAction: Enforce # Now blocking

Remaining violations: Platform team works 1-on-1 with teams or grants time-limited exemptions.

Governance: Who Decides What Tier?

Policy Review Board

Members:

Platform Lead
Security Engineer
2x Developer Representatives (rotates quarterly)

Responsibilities:

Review new policies: Assign tier (Critical, Important, Best Practice)
Monitor audit-to-enforce transitions: Are teams ready?
Handle exception requests: Approve/deny Tier 1 exemptions
Quarterly policy review: Should Tier 2 policies move to Tier 1? Retire unused policies?

Example Decision:

Proposed Policy: "All deployments must use Argo Rollouts for canary deployments"

Discussion:

Security: Not a security issue (Tier 3 at most)
Developer: This would block 80% of current deployments, high friction
Platform: Canary is best practice but not mandatory for low-traffic apps

Decision: Tier 3 (Audit Only)

Encourage adoption via education
Show metric: "Teams using canary have 50% lower CFR"
Don't enforce (too disruptive for marginal benefit)

Exception Request Process

Tier 1 (Critical) Exception:

Submit ticket with justification
Security team reviews
Approval requires:
Technical necessity (no alternative)
Compensating controls (extra monitoring, isolated namespace)
Time limit (expires in 90 days, must re-justify)
Documented in audit log

Tier 2 (Important) Exception:

Team lead approves via annotation

metadata:
  annotations:
    policy.fawkes.io/exception: "require-non-root"
    policy.fawkes.io/reason: "Legacy vendor image, migration planned Q2"
    policy.fawkes.io/expires: "2024-06-30"

Reviewed quarterly
Exemption expires automatically, must renew

Tier 3 (Best Practice) Exception:

No exception needed (audit-only)
Teams can ignore violations without approval

Trade-Offs: Security vs. Velocity

What Tiered Enforcement Gives You

Benefit	Impact
Gradual Adoption	Teams adapt to policies over time, not shocked by "big bang"
Developer Trust	Platform team seen as partner, not blocker
Higher Compliance	90%+ adoption vs. 50% with all-or-nothing enforcement
Risk Mitigation	Critical risks blocked immediately, lower risks addressed iteratively
Continuous Improvement	Policies evolve based on real-world usage, not theory

What Tiered Enforcement Costs You

Challenge	Mitigation
Complexity	Managing three tiers, transition timelines. Mitigation: Automation (policy reports, dashboards)
Incomplete Security	Tier 2/3 violations exist during migration. Mitigation: Acceptable risk for gradual improvement
Communication Overhead	Must explain tier decisions, migration plans. Mitigation: Transparent documentation, regular updates
Exception Sprawl	Teams request exemptions to avoid work. Mitigation: Time-limited exemptions, quarterly review

The Bet: 90% compliance with gradual enforcement beats 50% compliance with strict enforcement.

Metrics: Measuring Policy Success

Compliance Dashboard

Fawkes Policy Compliance (November 2024)

Overall Compliance: 87% (↑ 5% vs. last month)

Tier 1 (Critical) - Enforced:
  Compliance: 98%  ✅
  Violations: 3 (all approved exemptions)

Tier 2 (Important) - Audit → Enforce:
  Compliance: 82%  ⚠️
  Violations: 27
  On Track to Enforce: March 2025

Tier 3 (Best Practice) - Audit Only:
  Compliance: 65%  ℹ️
  Violations: 105
  Trend: ↑ 10% over 6 months (voluntary adoption)

Top Violations (Tier 2):
  1. Missing resource limits (18 pods)
  2. Non-root users (7 pods)
  3. Read-only filesystem (2 pods)

Action Items:
  - Office hours scheduled for teams with resource limit violations
  - Automated mutation policy proposed to add default limits

Policy Effectiveness Metrics

Metric	Target	Current
Tier 1 Compliance	>95%	98% ✅
Tier 2 Compliance	>85% before enforcement	82% ⚠️
Tier 2 Enforcement Deadlines Met	>80% on-time	100% ✅
Exception Requests (Tier 1)	<5 active	3 ✅
Time to Remediate Violations	<30 days	22 days ✅

ADR: ADR-017: Kyverno Policy Engine
Explanation: Zero Trust Security Model
How-To: Troubleshoot Kyverno Violations
Reference: Kyverno Policy Library

Conclusion

Security policies are necessary, but enforcement is an art, not a science.

The tiered enforcement model recognizes that:

Not all policies are equal - Some prevent immediate breaches (enforce now), others harden over time (audit first)
Developers need migration time - Sudden enforcement creates resistance and shadow IT
Culture beats compliance - Gradual adoption builds trust, all-or-nothing breeds resentment
Perfect is the enemy of good - 90% compliance with partnership beats 50% compliance with force

The Framework:

Tier 1 (Critical): Enforce immediately, exceptions rare
Tier 2 (Important): Audit → Enforce with migration timeline
Tier 3 (Best Practice): Audit only, encourage via visibility

The Philosophy: Security is a journey, not a destination. Meet developers where they are, guide them to where they should be.

This is Policy-as-Code with empathy.

Policy as Code Tiers: Audit vs. Enforce

Context

The Problem: All-or-Nothing Enforcement Fails

Scenario 1: The Zero-Tolerance Trap

Scenario 2: The Permissive Paralysis

The Solution: Tiered Enforcement Model

Tier 1: Critical (Enforce Immediately)

Tier 2: Important (Audit → Enforce)

Tier 3: Best Practice (Audit Only)

Policy Lifecycle: From Audit to Enforce

The Migration Path

Governance: Who Decides What Tier?

Policy Review Board

Exception Request Process

Trade-Offs: Security vs. Velocity

What Tiered Enforcement Gives You

What Tiered Enforcement Costs You

Metrics: Measuring Policy Success

Compliance Dashboard

Policy Effectiveness Metrics

Related Reading

Conclusion