Vector Database (Weaviate)

Overview

This document explains the vector database implementation in Fawkes using Weaviate. The vector database is a core component of the RAG (Retrieval Augmented Generation) system that powers AI-assisted development features.

What is a Vector Database?

A vector database is a specialized database designed to store and query high-dimensional vectors (embeddings) efficiently. Unlike traditional databases that store structured data, vector databases excel at:

Semantic Search: Finding similar content based on meaning, not just keywords
Similarity Matching: Finding items similar to a given item
Recommendation Systems: Suggesting relevant content based on context
Machine Learning: Supporting ML models with efficient vector operations

Key Concepts

Embeddings: Numerical representations of data (text, images, etc.) as vectors in high-dimensional space. Similar items have vectors that are close to each other.

Vector Search: Finding vectors in the database that are most similar to a query vector using distance metrics (e.g., cosine similarity, Euclidean distance).

HNSW (Hierarchical Navigable Small World): An efficient algorithm for approximate nearest neighbor search in high-dimensional spaces.

Why Weaviate?

Weaviate was chosen as the vector database for Fawkes for several reasons:

Technical Advantages

Native Vector Search: Built from the ground up for vector operations with HNSW algorithm
GraphQL API: Flexible, modern API that's easy to use and integrate
Built-in Vectorization: Supports multiple vectorization modules (transformers, OpenAI, etc.)
Kubernetes Native: Designed for cloud-native deployments with Helm charts
Hybrid Search: Combines vector search with traditional keyword search (BM25)
Schema Flexibility: Dynamic schema with strong typing

Operational Advantages

Active Community: Large community with good documentation and examples
Production Ready: Battle-tested in production by many organizations
Scalability: Horizontal scaling support with replication
Monitoring: Prometheus metrics built-in
Backup & Restore: Built-in backup and disaster recovery features

Alternatives Considered

Database	Pros	Cons	Decision
Weaviate	Native vector DB, GraphQL API, K8s native	Learning curve	✅ Selected
Pinecone	Fully managed, easy to use	Cloud-only, vendor lock-in	❌ Not self-hosted
Milvus	High performance, FAISS-based	Complex setup, heavy	❌ Too complex
PostgreSQL pgvector	Simple extension, familiar	Limited scale, slower	❌ Not specialized
ChromaDB	Simple, Python-first	Immature, limited features	❌ Too new

Architecture

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Application Layer                            │
│                                                                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│  │ AI Assistant │  │ RAG Service  │  │ Doc Search   │         │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘         │
│         │                  │                  │                  │
│         └──────────────────┴──────────────────┘                 │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             │ GraphQL/REST API
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                     Weaviate Cluster                             │
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ GraphQL API Layer                                         │  │
│  └──────────────────────────────────────────────────────────┘  │
│                             │                                    │
│  ┌──────────────┬───────────┴──────────┬──────────────┐       │
│  │              │                       │              │        │
│  │  Vector Index│   Object Store       │ Inverted Index│       │
│  │  (HNSW)      │                      │ (BM25)       │        │
│  └──────────────┴──────────────────────┴──────────────┘        │
│                             │                                    │
│                    Persistent Storage                            │
│                         (10GB PVC)                               │
└─────────────────────────────┬───────────────────────────────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│              Embedding Models (text2vec-transformers)            │
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ sentence-transformers/all-MiniLM-L6-v2                   │  │
│  │ - 384 dimensions                                          │  │
│  │ - Fast inference                                          │  │
│  │ - Good semantic understanding                             │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Data Model

Weaviate uses a class-based schema model:

{
  "class": "FawkesDocument",
  "description": "Fawkes platform documentation and code",
  "vectorizer": "text2vec-transformers",
  "properties": [
    {
      "name": "title",
      "dataType": ["string"],
      "description": "Document title"
    },
    {
      "name": "content",
      "dataType": ["text"],
      "description": "Document content (vectorized)"
    },
    {
      "name": "filepath",
      "dataType": ["string"],
      "description": "File path in repository"
    },
    {
      "name": "category",
      "dataType": ["string"],
      "description": "Document category"
    }
  ]
}

Deployment

Kubernetes Deployment

Weaviate is deployed via ArgoCD using the official Helm chart:

# Deployed automatically via ArgoCD
kubectl get application -n fawkes weaviate

# Check deployment status
kubectl get pods -n fawkes -l app.kubernetes.io/name=weaviate

Configuration

Key configuration parameters in platform/apps/weaviate-application.yaml:

Resources: 1 CPU, 2Gi memory
Storage: 10GB persistent volume
Replicas: 1 (can scale horizontally)
Modules: text2vec-transformers enabled
Authentication: Anonymous access enabled (MVP)

Accessing Weaviate

Local Development (Port Forward):

kubectl port-forward -n fawkes svc/weaviate 8080:80

Within Cluster:

http://weaviate.fawkes.svc:80

Via Ingress:

http://weaviate.127.0.0.1.nip.io

How to Index New Documents

Using Python Client

import weaviate

# Connect to Weaviate
client = weaviate.Client("http://weaviate.fawkes.svc:80")

# Create or get class
schema = {
    "class": "FawkesDocument",
    "vectorizer": "text2vec-transformers",
    "properties": [
        {
            "name": "title",
            "dataType": ["string"]
        },
        {
            "name": "content",
            "dataType": ["text"]
        }
    ]
}

# Create class if doesn't exist
try:
    client.schema.create_class(schema)
except:
    pass  # Already exists

# Index a document
doc = {
    "title": "Getting Started Guide",
    "content": "This guide will help you get started with Fawkes...",
    "filepath": "docs/getting-started.md",
    "category": "documentation"
}

client.data_object.create(
    data_object=doc,
    class_name="FawkesDocument"
)

# Batch indexing for better performance
with client.batch as batch:
    batch.batch_size = 100

    for doc in documents:
        batch.add_data_object(
            data_object=doc,
            class_name="FawkesDocument"
        )

Indexing Strategies

Full Re-index: Index all documents from scratch (for initial setup)

python services/rag/scripts/index-all-docs.py

Incremental Indexing: Index only changed files (for CI/CD pipeline)

python services/rag/scripts/index-changed-docs.py --since HEAD~1

Scheduled Indexing: Daily re-index via CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: weaviate-indexing
spec:
  schedule: "0 2 * * *" # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: indexer
              image: fawkes-rag-indexer:latest
              command: ["python", "index-all-docs.py"]

How to Query and Retrieve Context

Semantic Search

Find documents by semantic similarity:

# Query by text
result = (
    client.query
    .get("FawkesDocument", ["title", "content", "filepath"])
    .with_near_text({
        "concepts": ["How to deploy applications with ArgoCD"]
    })
    .with_limit(5)
    .with_additional(["certainty", "distance"])
    .do()
)

# Process results
for doc in result["data"]["Get"]["FawkesDocument"]:
    print(f"Title: {doc['title']}")
    print(f"Certainty: {doc['_additional']['certainty']}")
    print(f"Content: {doc['content'][:200]}...")

Hybrid Search

Combine vector search with keyword search:

result = (
    client.query
    .get("FawkesDocument", ["title", "content"])
    .with_hybrid(
        query="ArgoCD deployment",
        alpha=0.5  # 0=keyword only, 1=vector only, 0.5=balanced
    )
    .with_limit(10)
    .do()
)

Filtered Search

Add filters to narrow results:

result = (
    client.query
    .get("FawkesDocument", ["title", "content"])
    .with_near_text({"concepts": ["security scanning"]})
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueString": "documentation"
    })
    .with_limit(5)
    .do()
)

GraphQL Queries

Direct GraphQL query:

{
  Get {
    FawkesDocument(
      nearText: { concepts: ["kubernetes deployment guide"] }
      where: { path: ["category"], operator: Equal, valueString: "documentation" }
      limit: 5
    ) {
      title
      filepath
      content
      _additional {
        certainty
        distance
      }
    }
  }
}

RAG Integration Pattern

Integrate with LLM for RAG:

def answer_question(question: str, llm_client) -> str:
    """Answer question using RAG pattern."""

    # 1. Retrieve relevant context from Weaviate
    result = (
        client.query
        .get("FawkesDocument", ["title", "content"])
        .with_near_text({"concepts": [question]})
        .with_limit(5)
        .with_additional(["certainty"])
        .do()
    )

    docs = result["data"]["Get"]["FawkesDocument"]

    # 2. Filter by relevance threshold
    relevant_docs = [
        doc for doc in docs
        if doc["_additional"]["certainty"] > 0.7
    ]

    # 3. Construct context
    context = "\n\n".join([
        f"# {doc['title']}\n{doc['content']}"
        for doc in relevant_docs
    ])

    # 4. Create prompt with context
    prompt = f"""Based on the following context from Fawkes documentation,
answer the question.

Context:
{context}

Question: {question}

Answer:"""

    # 5. Get answer from LLM
    answer = llm_client.generate(prompt)

    return answer

Performance Tuning

Query Performance

HNSW Configuration: Tune for speed vs. accuracy trade-off

schema = {
    "class": "FawkesDocument",
    "vectorIndexConfig": {
        "efConstruction": 128,  # Higher = better accuracy, slower build
        "maxConnections": 64,   # Higher = better accuracy, more memory
        "ef": 64                # Higher = better accuracy, slower query
    }
}

Batch Size: Optimize batch operations

with client.batch as batch:
    batch.batch_size = 100      # Tune based on document size
    batch.num_workers = 4       # Parallel processing
    batch.connection_error_retries = 3

Indexing Performance

Parallel Indexing: Use multiple workers

from concurrent.futures import ThreadPoolExecutor

def index_documents_parallel(documents, num_workers=4):
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        executor.map(lambda doc: index_document(doc), documents)

Incremental Updates: Only index changed documents

# Get changed files since last index
git diff --name-only HEAD~1 docs/ | \
  xargs python services/rag/scripts/index-docs.py

Resource Optimization

Memory: Adjust based on dataset size

resources:
  requests:
    memory: 2Gi # Minimum for small dataset
  limits:
    memory: 4Gi # Scale based on # of vectors

CPU: Scale for query throughput

resources:
  requests:
    cpu: 1
  limits:
    cpu: 2

Storage: Plan for growth

persistence:
  size: 10Gi # Start size
  # Monitor usage and scale as needed

Monitoring Performance

Key Metrics:

# Query latency
weaviate_query_duration_seconds

# Object count
weaviate_objects_total

# Vector index size
weaviate_vector_index_size

# Memory usage
weaviate_memory_usage_bytes

Query Performance Dashboard:

# 95th percentile query latency
histogram_quantile(0.95,
  rate(weaviate_query_duration_seconds_bucket[5m])
)

# Queries per second
rate(weaviate_queries_total[1m])

Troubleshooting

Common Issues

Weaviate Not Ready

Symptom: Weaviate returns 503 or connection refused

Solution:

# Check pod status
kubectl get pods -n fawkes -l app.kubernetes.io/name=weaviate

# Check logs
kubectl logs -n fawkes -l app.kubernetes.io/name=weaviate

# Restart if needed
kubectl rollout restart statefulset/weaviate -n fawkes

Indexing Failures

Symptom: Documents not appearing in search results

Solution:

# Check if class exists
schema = client.schema.get("FawkesDocument")
print(schema)

# Verify object count
result = client.query.aggregate("FawkesDocument").with_meta_count().do()
print(f"Objects: {result['data']['Aggregate']['FawkesDocument'][0]['meta']['count']}")

# Check batch import errors
with client.batch as batch:
    batch.batch_size = 10
    # ... add objects ...

    # Check for errors
    if batch.failed_objects:
        print(f"Failed: {batch.failed_objects}")

Low Relevance Scores

Symptom: Search results have certainty < 0.7

Possible Causes:

Documents not properly indexed
Query doesn't match document content
Wrong vectorizer model

Solution:

# Try hybrid search instead
result = (
    client.query
    .get("FawkesDocument", ["title"])
    .with_hybrid(query="your query", alpha=0.5)
    .with_limit(5)
    .do()
)

# Verify vectorizer is working
modules = client.get_meta()["modules"]
print(f"Available modules: {modules}")

Out of Memory

Symptom: Weaviate pod OOMKilled

Solution:

# Increase memory limits
kubectl edit statefulset weaviate -n fawkes

# Or scale vertically in ArgoCD app
# resources.limits.memory: 4Gi

# Monitor memory usage
kubectl top pod -n fawkes weaviate-0

Slow Queries

Symptom: Queries taking >1 second

Solution:

# Increase ef for better performance
client.schema.update_config(
    "FawkesDocument",
    {
        "vectorIndexConfig": {
            "ef": 128  # Increase from default 64
        }
    }
)

# Use filters to reduce search space
result = (
    client.query
    .get("FawkesDocument", ["title"])
    .with_near_text({"concepts": ["query"]})
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueString": "documentation"
    })
    .with_limit(5)
    .do()
)

Debugging Tips

Enable Verbose Logging:

env:
  LOG_LEVEL: "debug"

Check API Endpoints:

# Health check
curl http://weaviate.fawkes.svc:80/v1/.well-known/ready

# Meta information
curl http://weaviate.fawkes.svc:80/v1/meta

# Schema
curl http://weaviate.fawkes.svc:80/v1/schema

# Objects count
curl http://weaviate.fawkes.svc:80/v1/objects

Test Connection:

import weaviate

client = weaviate.Client("http://weaviate.fawkes.svc:80")
print(f"Ready: {client.is_ready()}")
print(f"Live: {client.is_live()}")
print(f"Meta: {client.get_meta()}")

Security Considerations

Authentication (Production)

For production, enable authentication:

env:
  AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "false"
  AUTHENTICATION_APIKEY_ENABLED: "true"
  AUTHENTICATION_APIKEY_ALLOWED_KEYS: "admin-key,readonly-key"
  AUTHORIZATION_ADMINLIST_ENABLED: "true"
  AUTHORIZATION_ADMINLIST_USERS: "admin"

Network Policies

Restrict access to Weaviate:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: weaviate-netpol
  namespace: fawkes
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: weaviate
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: rag-service
      ports:
        - protocol: TCP
          port: 8080

Data Privacy

Sensitive Data: Don't index sensitive information (credentials, PII)

Access Control: Implement RBAC for API access

Encryption: Enable TLS for API communication

Backup and Disaster Recovery

Manual Backup

# Create backup
kubectl exec -n fawkes weaviate-0 -- \
  curl -X POST http://localhost:8080/v1/backups/filesystem

# List backups
kubectl exec -n fawkes weaviate-0 -- \
  curl http://localhost:8080/v1/backups/filesystem

Automated Backup

Create a CronJob for regular backups:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: weaviate-backup
  namespace: fawkes
spec:
  schedule: "0 1 * * *" # Daily at 1 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  kubectl exec weaviate-0 -n fawkes -- \
                    curl -X POST http://localhost:8080/v1/backups/filesystem
          restartPolicy: OnFailure

Restore from Backup

# List available backups
kubectl exec -n fawkes weaviate-0 -- \
  curl http://localhost:8080/v1/backups/filesystem

# Restore
kubectl exec -n fawkes weaviate-0 -- \
  curl -X POST http://localhost:8080/v1/backups/filesystem/<backup-id>/restore

Vector Database (Weaviate)

Overview

What is a Vector Database?

Key Concepts

Why Weaviate?

Technical Advantages

Operational Advantages

Alternatives Considered

Architecture

System Architecture

Data Model

Deployment

Kubernetes Deployment

Configuration

Accessing Weaviate

How to Index New Documents

Using Python Client

Indexing Strategies

How to Query and Retrieve Context

Semantic Search

Hybrid Search

Filtered Search

GraphQL Queries

RAG Integration Pattern

Performance Tuning

Query Performance

Indexing Performance

Resource Optimization

Monitoring Performance

Troubleshooting

Common Issues

Weaviate Not Ready

Indexing Failures

Low Relevance Scores

Out of Memory

Slow Queries

Debugging Tips

Security Considerations

Authentication (Production)

Network Policies

Data Privacy

Backup and Disaster Recovery

Manual Backup

Automated Backup

Restore from Backup

Related Documentation

References