Skip to content

ADR-031: Vector Database Selection for RAG System

Status

Accepted

Context

We need a vector database to enable Retrieval Augmented Generation (RAG) capabilities in Fawkes for AI-assisted development. The vector database will:

  • Store embeddings of internal documentation, code, and platform knowledge
  • Enable semantic search based on meaning rather than keywords
  • Support AI assistants with contextual information retrieval
  • Scale to handle the platform's growing documentation and code base

Requirements

Functional Requirements:

  • Vector similarity search with high precision (>0.7 relevance score)
  • Support for text embeddings (initially)
  • GraphQL or REST API for integration
  • Schema flexibility for different content types
  • Hybrid search (vector + keyword) capabilities

Non-Functional Requirements:

  • Kubernetes-native deployment
  • Horizontal scalability
  • Built-in monitoring (Prometheus metrics)
  • Backup and restore capabilities
  • Open-source with active community
  • Production-ready and battle-tested

Integration Requirements:

  • Compatible with transformer models (sentence-transformers)
  • Easy integration with Python applications
  • Support for batch operations
  • Low-latency queries (<100ms)

Decision

We will use Weaviate as our vector database for the following reasons:

Technical Rationale

  1. Native Vector Search

  2. Built specifically for vector operations using HNSW (Hierarchical Navigable Small World) algorithm

  3. Provides fast approximate nearest neighbor search
  4. Supports multiple distance metrics (cosine, L2, etc.)

  5. GraphQL API

  6. Modern, flexible API that's easy to use

  7. Strong typing and schema validation
  8. Good documentation and tooling support
  9. Native Python client library

  10. Built-in Vectorization

  11. Supports text2vec-transformers module out of the box

  12. Can use sentence-transformers models directly
  13. Extensible to other vectorization methods (OpenAI, Cohere, etc.)
  14. Handles vectorization automatically

  15. Hybrid Search Capability

  16. Combines vector search with traditional keyword search (BM25)

  17. Best of both worlds for different query types
  18. Configurable weight between vector and keyword search

  19. Kubernetes Native

  20. Official Helm charts maintained by Weaviate
  21. Designed for cloud-native deployments
  22. Supports StatefulSets for persistence
  23. Good resource management

Operational Rationale

  1. Production Ready

  2. Used by many organizations in production

  3. Proven track record for reliability
  4. Good performance characteristics
  5. Mature codebase (4+ years old)

  6. Active Community

  7. Large and growing community

  8. Excellent documentation
  9. Active development (frequent releases)
  10. Good support channels (Discord, GitHub)

  11. Monitoring and Observability

  12. Built-in Prometheus metrics

  13. Grafana dashboards available
  14. Detailed logging
  15. Health check endpoints

  16. Backup and Recovery

  17. Built-in backup functionality
  18. Point-in-time recovery
  19. Multiple backup backends supported
  20. Well-documented disaster recovery procedures

Alternatives Considered

Pinecone

Pros:

  • Fully managed service
  • Very easy to use
  • Good performance
  • Excellent documentation

Cons:

  • Cloud-only (SaaS)
  • Vendor lock-in
  • Not self-hosted
  • Cost increases with scale

Decision: ❌ Rejected - We need a self-hosted solution to maintain control and reduce operational costs.

Milvus

Pros:

  • High performance
  • Based on FAISS
  • Large feature set
  • Good scalability

Cons:

  • Complex setup and operation
  • Heavy resource requirements
  • Steeper learning curve
  • More infrastructure to manage

Decision: ❌ Rejected - Too complex for our current needs; Weaviate provides sufficient performance with simpler operations.

PostgreSQL with pgvector Extension

Pros:

  • Familiar database
  • Simple extension
  • Easy to get started
  • No new infrastructure

Cons:

  • Not purpose-built for vectors
  • Limited scalability
  • Slower for large datasets
  • Less sophisticated search algorithms

Decision: ❌ Rejected - Not specialized enough; performance degrades at scale.

ChromaDB

Pros:

  • Simple and lightweight
  • Python-first design
  • Easy to embed
  • Good for prototyping

Cons:

  • Relatively new/immature
  • Limited production usage
  • Fewer features
  • Less proven at scale

Decision: ❌ Rejected - Too new and unproven for production use; prefer more mature solution.

Qdrant

Pros:

  • Good performance
  • Written in Rust
  • Growing community
  • Modern architecture

Cons:

  • Smaller community than Weaviate
  • Less mature ecosystem
  • Fewer integrations
  • Less documentation

Decision: ❌ Considered but Weaviate has better ecosystem and documentation.

Consequences

Positive

  1. Fast Semantic Search

  2. HNSW algorithm provides excellent performance

  3. Sub-100ms queries for most use cases
  4. Scales well with dataset size

  5. Flexible Schema

  6. Can easily add new document types

  7. Strong typing prevents errors
  8. GraphQL makes schema discovery easy

  9. Easy Integration

  10. Well-documented Python client

  11. Simple API design
  12. Good examples and tutorials

  13. Kubernetes Native

  14. Fits well with existing platform

  15. Uses standard Kubernetes patterns
  16. Easy to operate with existing tools

  17. Active Development

  18. Regular updates and improvements

  19. Security patches
  20. New features added frequently

  21. Good Monitoring

  22. Integrates with existing Prometheus/Grafana stack
  23. Pre-built dashboards available
  24. Detailed metrics exposed

Negative

  1. Learning Curve

  2. Team needs to learn vector database concepts

  3. GraphQL may be new to some developers
  4. HNSW tuning requires understanding

  5. Additional Infrastructure

  6. New component to maintain

  7. Requires persistent storage
  8. Adds to infrastructure complexity

  9. Resource Requirements

  10. Memory-intensive for large datasets

  11. CPU for vectorization
  12. Storage for vectors and data

  13. Operational Overhead

  14. Need to manage backups
  15. Need to monitor performance
  16. Need to plan capacity

Mitigation Strategies

  1. Training and Documentation

  2. Create comprehensive documentation (done: docs/ai/vector-database.md)

  3. Provide examples and tutorials
  4. Conduct knowledge sharing sessions

  5. Start Small

  6. Begin with 1 replica

  7. Use modest resources (2Gi RAM, 1 CPU)
  8. Scale up based on actual usage

  9. Monitoring from Day 1

  10. Enable Prometheus metrics

  11. Create Grafana dashboards
  12. Set up alerts for issues

  13. Backup Strategy

  14. Implement automated daily backups
  15. Test restore procedures
  16. Document disaster recovery process

Implementation Plan

  1. Phase 1: Deployment (Done)

  2. Deploy Weaviate via ArgoCD ✅

  3. Configure persistent storage (10GB) ✅
  4. Enable text2vec-transformers module ✅
  5. Set up Prometheus monitoring ✅

  6. Phase 2: Testing (In Progress)

  7. Create test indexing script ✅

  8. Index sample documents ✅
  9. Validate search functionality ⏳
  10. Verify relevance scores >0.7 ⏳

  11. Phase 3: Production Indexing (Future)

  12. Index all platform documentation

  13. Index ADRs and runbooks
  14. Index code examples
  15. Set up incremental indexing

  16. Phase 4: Integration (Future)

  17. Integrate with AI assistant
  18. Build RAG pipeline
  19. Create query interface
  20. Add to Backstage portal

Validation

The decision will be validated by:

  1. Performance Metrics

  2. Query latency <100ms for 95th percentile

  3. Relevance scores >0.7 for semantic queries
  4. Indexing throughput >100 documents/second

  5. Operational Metrics

  6. Uptime >99.9%

  7. Successful backups daily
  8. Recovery time <30 minutes

  9. User Feedback

  10. AI assistant provides relevant context
  11. Documentation search returns useful results
  12. Development productivity improvements

References

  • ADR-001: Kubernetes Orchestration (infrastructure platform)
  • ADR-003: ArgoCD for GitOps (deployment method)
  • ADR-006: PostgreSQL (relational data storage)

Revision History

  • 2025-12-21: Initial version - Vector database selection for RAG system