Skip to content

Known Limitations — Fawkes IDP

Purpose: This file catalogues known limitations, gaps, and degraded-mode behaviours in the Fawkes platform. Agents are instructed not to make these worse. Humans reviewing agent-generated changes should verify that none of these limitations are exacerbated.

Update this file whenever a limitation is discovered, resolved, or worsened. Link to the tracking issue where one exists.


KL-01 — No Terraform Remote Backend

Description: All Terraform state is stored locally (.tfstate files on disk). There is no remote backend (S3 + DynamoDB, Azure Blob, Terraform Cloud, etc.) configured for any module under infra/.

Impact:

  • State files may be committed to Git accidentally, exposing sensitive resource metadata.
  • Concurrent terraform apply runs will corrupt state — no state locking is in place.
  • Disaster recovery of infrastructure state is not possible without the local file.
  • Collaborative IaC workflows (multiple engineers or CI) are unsafe without a shared backend.

Tracking: GAP-7 — Migrate Terraform state to a remote backend with locking.


KL-02 — Weaviate Vector Database Required for RAG Service (No Local Fallback)

Description: The RAG (Retrieval-Augmented Generation) service depends on a running Weaviate vector database instance. There is no local in-memory fallback or stub implementation available for local development or CI environments that do not have Weaviate deployed.

Impact:

  • Developers without a Weaviate instance cannot run the RAG service locally.
  • Integration tests that exercise the RAG path are skipped or fail in environments without Weaviate.
  • The tests/bdd/ scenarios that cover RAG features have no executable step definitions when Weaviate is absent (see also KL-05).

Tracking: No dedicated issue yet — see KL-05 for related BDD gap.


KL-03 — Focalboard Integration Operates in Degraded Mode

Description: The Value Stream Mapping (VSM) component integrates with Focalboard for project-level card and board data. This integration is optional — if the Focalboard API is unreachable, the VSM falls back to a degraded read-only view with stale or empty board data.

Impact:

  • Board data displayed in the VSM may be stale or absent when Focalboard is offline.
  • No alerting or user-visible warning is shown when VSM is operating in degraded mode.
  • Teams relying on Focalboard cards for DORA change-failure-rate attribution will see incomplete data.

Tracking: No dedicated issue. Alerting on degraded mode is untracked.


KL-04 — Azure Module Duplication (Pending Deprecation)

Description: The infra/azure/ directory contains duplicated Terraform module definitions that overlap with the consolidated modules introduced in infra/terraform/. The duplicated modules have diverged in variable naming conventions and output schemas.

Impact:

  • Changes to shared networking or IAM logic must be applied in two places.
  • Risk of configuration drift between the duplicate modules.
  • New Azure resource additions may be applied to only one module tree, creating inconsistent environments.

Tracking: BUG-8 — Deprecate and remove legacy infra/azure/ duplicate modules.


KL-05 — 45 BDD Features Have No Step Definitions

Description: There are approximately 45 Gherkin feature files under tests/bdd/features/ whose scenarios have no corresponding step-definition implementations. Running behave tests/bdd/features for these scenarios results in NotImplementedError or Undefined step failures.

Impact:

  • These scenarios cannot be used to gate a PR or deployment — they provide no automated signal.
  • The BDD suite gives a false sense of coverage completeness.
  • New engineers may assume these features are tested when they are not.

Tracking: Tracked implicitly by the Sprint 2 BDD implementation backlog. No single consolidated issue exists.


KL-06 — DevLake ArgoCD Plugin Requires Manual Connection Configuration

Description: The DevLake integration with ArgoCD (used for DORA deployment-frequency and lead-time metrics) requires a one-time manual configuration step inside the DevLake admin UI to establish the ArgoCD API connection. Specifically, an engineer must navigate to Settings → Connections → ArgoCD and supply the ArgoCD server URL, bearer token, and TLS verification settings. This step is not automated by Helm values, Kubernetes Jobs, or any GitOps mechanism.

Impact:

  • After every fresh DevLake install (or namespace wipe), an engineer must manually re-enter the ArgoCD connection details in the DevLake UI.
  • Automated environment provisioning (e.g., ephemeral preview environments) will not collect DORA metrics until the manual step is completed.
  • There is no validation in CI that the connection is healthy.

Tracking: No dedicated issue. Add a post-install Helm hook or a scripts/ helper to automate this step.


KL-07 — MTTR Tracking Covers Only Jenkins Pipeline Failures

Description: Mean Time To Recovery (MTTR) is currently measured only for Jenkins pipeline failures — specifically the duration between a pipeline failure event and the next successful run of the same pipeline. Production incidents (PagerDuty alerts, SLO breaches, rollback events) are not tracked.

Impact:

  • The MTTR metric shown in Grafana dashboards is not a true production MTTR.
  • Elite/High/Medium/Low tier classification based on MTTR may be misleading.
  • Post-incident reviews cannot be correlated with MTTR data from the platform.

Tracking: No dedicated issue. Extend MTTR collection to ingest PagerDuty or Alertmanager resolved-alert events.


KL-08 — Rework Rate Detection Uses SHA Heuristic (Weak Signal)

Description: The rework rate metric (docs/METRICS.md, computed by scripts/weekly-metrics.sh) estimates rework by counting commits whose message matches patterns such as fix:, hotfix:, or revert: relative to total commits. This relies on Conventional Commits — a commit message convention where the prefix (e.g., feat:, fix:, chore:) signals the intent of the change. This approach is a SHA-count heuristic — it does not analyse the actual code churn or correlate fixes to specific features or PRs.

Impact:

  • Rework rate will be underreported if engineers do not use Conventional Commits.
  • A single large fix: commit touching 500 lines is weighted the same as a one-line typo correction.
  • The metric cannot distinguish between fixing a new regression and fixing pre-existing technical debt.
  • Teams may game the metric by using non-conventional commit prefixes for fix commits.

Tracking: No dedicated issue. Consider integrating with GitHub PR labels (e.g., type: bug) or Jira issue types for a stronger rework signal.