Known Limitations — Fawkes IDP

Purpose: This file catalogues known limitations, gaps, and degraded-mode behaviours in the Fawkes platform. Agents are instructed not to make these worse. Humans reviewing agent-generated changes should verify that none of these limitations are exacerbated.

Update this file whenever a limitation is discovered, resolved, or worsened. Link to the tracking issue where one exists.

KL-01 — No Terraform Remote Backend

Description: All Terraform state is stored locally (.tfstate files on disk). There is no remote backend (S3 + DynamoDB, Azure Blob, Terraform Cloud, etc.) configured for any module under infra/.

Impact:

State files may be committed to Git accidentally, exposing sensitive resource metadata.
Concurrent terraform apply runs will corrupt state — no state locking is in place.
Disaster recovery of infrastructure state is not possible without the local file.
Collaborative IaC workflows (multiple engineers or CI) are unsafe without a shared backend.

Tracking: GAP-7 — Migrate Terraform state to a remote backend with locking.

KL-02 — Weaviate Vector Database Required for RAG Service (No Local Fallback)

Description: The RAG (Retrieval-Augmented Generation) service depends on a running Weaviate vector database instance. There is no local in-memory fallback or stub implementation available for local development or CI environments that do not have Weaviate deployed.

Impact:

Developers without a Weaviate instance cannot run the RAG service locally.
Integration tests that exercise the RAG path are skipped or fail in environments without Weaviate.
The tests/bdd/ scenarios that cover RAG features have no executable step definitions when Weaviate is absent (see also KL-05).

Tracking: No dedicated issue yet — see KL-05 for related BDD gap.

KL-03 — Focalboard Integration Operates in Degraded Mode

Description: The Value Stream Mapping (VSM) component integrates with Focalboard for project-level card and board data. This integration is optional — if the Focalboard API is unreachable, the VSM falls back to a degraded read-only view with stale or empty board data.

Impact:

Board data displayed in the VSM may be stale or absent when Focalboard is offline.
No alerting or user-visible warning is shown when VSM is operating in degraded mode.
Teams relying on Focalboard cards for DORA change-failure-rate attribution will see incomplete data.

Tracking: No dedicated issue. Alerting on degraded mode is untracked.

KL-04 — Azure Module Duplication (Pending Deprecation)

Description: The infra/azure/ directory contains duplicated Terraform module definitions that overlap with the consolidated modules introduced in infra/terraform/. The duplicated modules have diverged in variable naming conventions and output schemas.

Impact:

Changes to shared networking or IAM logic must be applied in two places.
Risk of configuration drift between the duplicate modules.
New Azure resource additions may be applied to only one module tree, creating inconsistent environments.

Tracking: BUG-8 — Deprecate and remove legacy infra/azure/ duplicate modules.

KL-05 — 45 BDD Features Have No Step Definitions

Description: There are approximately 45 Gherkin feature files under tests/bdd/features/ whose scenarios have no corresponding step-definition implementations. Running behave tests/bdd/features for these scenarios results in NotImplementedError or Undefined step failures.

Impact:

These scenarios cannot be used to gate a PR or deployment — they provide no automated signal.
The BDD suite gives a false sense of coverage completeness.
New engineers may assume these features are tested when they are not.

Tracking: Tracked implicitly by the Sprint 2 BDD implementation backlog. No single consolidated issue exists.

KL-06 — DevLake ArgoCD Plugin Requires Manual Connection Configuration

Description: The DevLake integration with ArgoCD (used for DORA deployment-frequency and lead-time metrics) requires a one-time manual configuration step inside the DevLake admin UI to establish the ArgoCD API connection. Specifically, an engineer must navigate to Settings → Connections → ArgoCD and supply the ArgoCD server URL, bearer token, and TLS verification settings. This step is not automated by Helm values, Kubernetes Jobs, or any GitOps mechanism.

Impact:

After every fresh DevLake install (or namespace wipe), an engineer must manually re-enter the ArgoCD connection details in the DevLake UI.
Automated environment provisioning (e.g., ephemeral preview environments) will not collect DORA metrics until the manual step is completed.
There is no validation in CI that the connection is healthy.

Tracking: No dedicated issue. Add a post-install Helm hook or a scripts/ helper to automate this step.

KL-07 — MTTR Tracking Covers Only Jenkins Pipeline Failures

Description: Mean Time To Recovery (MTTR) is currently measured only for Jenkins pipeline failures — specifically the duration between a pipeline failure event and the next successful run of the same pipeline. Production incidents (PagerDuty alerts, SLO breaches, rollback events) are not tracked.

Impact:

The MTTR metric shown in Grafana dashboards is not a true production MTTR.
Elite/High/Medium/Low tier classification based on MTTR may be misleading.
Post-incident reviews cannot be correlated with MTTR data from the platform.

Tracking: No dedicated issue. Extend MTTR collection to ingest PagerDuty or Alertmanager resolved-alert events.

KL-08 — Rework Rate Detection Uses SHA Heuristic (Weak Signal)

Description: The rework rate metric (docs/METRICS.md, computed by scripts/weekly-metrics.sh) estimates rework by counting commits whose message matches patterns such as fix:, hotfix:, or revert: relative to total commits. This relies on Conventional Commits — a commit message convention where the prefix (e.g., feat:, fix:, chore:) signals the intent of the change. This approach is a SHA-count heuristic — it does not analyse the actual code churn or correlate fixes to specific features or PRs.

Impact:

Rework rate will be underreported if engineers do not use Conventional Commits.
A single large fix: commit touching 500 lines is weighted the same as a one-line typo correction.
The metric cannot distinguish between fixing a new regression and fixing pre-existing technical debt.
Teams may game the metric by using non-conventional commit prefixes for fix commits.

Tracking: No dedicated issue. Consider integrating with GitHub PR labels (e.g., type: bug) or Jira issue types for a stronger rework signal.