DataHub Data Catalog - Overview
What is DataHub?
DataHub is an open-source metadata platform that enables data discovery, observability, and governance across all data assets in the Fawkes platform. It acts as a centralized catalog where teams can find, understand, and trust their data.
Why We Use DataHub
In modern data platforms, data is scattered across multiple systems - databases, data lakes, data warehouses, and streaming platforms. DataHub solves key challenges:
- Data Discovery: Find the right data quickly through powerful search and browse capabilities
- Data Lineage: Understand where data comes from and how it flows through systems
- Data Governance: Manage data ownership, classification, and compliance
- Data Quality: Track data quality metrics and identify issues
- Collaboration: Share knowledge about data through descriptions, tags, and documentation
- Impact Analysis: Understand downstream impacts before making changes
DataHub integrates seamlessly with our existing platform components:
- PostgreSQL databases for application data
- Kafka/event streams for real-time data (future)
- S3/object storage for data lakes (future)
- dbt models for transformation lineage (future)
Architecture Overview
DataHub follows a microservices architecture with the following key components:
┌─────────────────────────────────────────────────────────────────┐
│ Data Sources │
│ ├─ PostgreSQL databases (Backstage, Focalboard, etc.) │
│ ├─ Future: Kafka topics │
│ ├─ Future: S3 buckets │
│ └─ Future: dbt models │
└────────────────────────┬────────────────────────────────────────┘
│ Metadata Ingestion
▼
┌─────────────────────────────────────────────────────────────────┐
│ DataHub Components │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ GMS (Graph Metadata Service) │ │
│ │ - Core metadata API │ │
│ │ - GraphQL endpoint │ │
│ │ - REST API │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────┴──────────────────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌──────────────────┐ │
│ │ PostgreSQL │ │ OpenSearch │ │
│ │ - Metadata storage │ │ - Search index │ │
│ │ - Versioning │ │ - Full-text │ │
│ └─────────────────────┘ └──────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Frontend (React UI) │ │
│ │ - Search & browse │ │
│ │ - Lineage visualization │ │
│ │ - Metadata management │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Users │
│ ├─ Data Engineers │
│ ├─ Data Scientists │
│ ├─ Analytics Engineers │
│ └─ Product Managers │
└─────────────────────────────────────────────────────────────────┘
Component Details
1. GMS (Graph Metadata Service)
- Purpose: Core backend service that stores and serves metadata
- Port: 8080
- APIs: GraphQL and REST
- Storage: PostgreSQL for metadata, OpenSearch for search indexing
- Features:
- CRUD operations on metadata entities
- Relationship management
- Search and discovery
- Access control (future)
2. DataHub Frontend
- Purpose: Web-based UI for interacting with metadata
- Port: 9002
- Technology: React.js
- Features:
- Visual search and browse
- Interactive lineage graphs
- Metadata editing
- User management
3. PostgreSQL
- Purpose: Primary metadata storage
- Implementation: CloudNativePG cluster (db-datahub-dev)
- Configuration: HA with 3 replicas
- Location:
db-datahub-dev-rw.fawkes.svc.cluster.local:5432
4. OpenSearch
- Purpose: Search indexing for fast metadata discovery
- Implementation: Shared OpenSearch cluster in logging namespace
- Location:
opensearch-cluster-master.logging.svc.cluster.local:9200
How to Access DataHub
Local Development
DataHub UI is accessible at:
http://datahub.127.0.0.1.nip.io
Default Credentials (MVP)
For the initial MVP deployment, authentication is simplified:
- Username:
datahub - Password:
datahub
⚠️ Production Note: For production deployments, enable OIDC authentication with GitHub OAuth or use other enterprise SSO solutions.
API Access
GraphQL API
# GraphQL endpoint
curl -X POST http://datahub.127.0.0.1.nip.io/api/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ search(input: {type: DATASET, query: \"*\", start: 0, count: 10}) { total } }"}'
REST API (via GMS)
# Internal GMS REST API
curl http://datahub-datahub-gms.fawkes.svc:8080/entities
How to Search for Data
DataHub provides multiple ways to discover data assets:
1. Search by Keywords
Navigate to the search bar and enter keywords:
- Dataset names (e.g., "backstage_users")
- Column names (e.g., "email")
- Descriptions
- Tags
Example Searches:
backstage- Find all assets related to Backstageuser email- Find datasets containing user email datatag:PII- Find all assets tagged as PII
2. Browse by Platform
Click on "Browse" and navigate through:
- Platforms: PostgreSQL, Kafka, S3, etc.
- Domains: Analytics, Operations, Product, etc.
- Ownership: Filter by team or owner
3. Advanced Filters
Use the filter panel to refine searches:
- Entity Type: Dataset, Dashboard, Pipeline, etc.
- Platform: PostgreSQL, MySQL, BigQuery, etc.
- Domain: Engineering, Marketing, Finance, etc.
- Tags: PII, Critical, Deprecated, etc.
- Ownership: Team or individual
4. Using the CLI
Install DataHub CLI:
pip install 'acryl-datahub[all]'
Search from command line:
# Search for datasets
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:postgres,backstage.users,PROD)"
How to Add Metadata
Method 1: Through the UI
- Navigate to an Asset: Search or browse to find the dataset
- Click "Edit": Located in the top-right corner
-
Add Information:
-
Description: Explain what this dataset contains
- Tags: Add relevant tags (e.g., PII, Critical, Deprecated)
- Domain: Assign to a business domain
- Owners: Add technical and business owners
-
Custom Properties: Add key-value pairs for additional context
-
Save Changes: Click "Save" to persist metadata
Method 2: Using DataHub CLI
Create an ingestion recipe:
# postgres-recipe.yml
source:
type: postgres
config:
host_port: "db-backstage-dev-rw.fawkes.svc.cluster.local:5432"
database: "backstage_db"
username: "${POSTGRES_USER}"
password: "${POSTGRES_PASSWORD}"
include_tables: true
include_views: true
profiling:
enabled: true
profile_table_level_only: false
sink:
type: datahub-rest
config:
server: "http://datahub-datahub-gms.fawkes.svc:8080"
Run ingestion:
# Set credentials
export POSTGRES_USER="backstage_user"
export POSTGRES_PASSWORD="your-password"
# Ingest metadata
datahub ingest -c postgres-recipe.yml
Method 3: Programmatic API
Use Python SDK:
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import GlobalTagsClass, TagAssociationClass
# Initialize emitter
emitter = DatahubRestEmitter("http://datahub-datahub-gms.fawkes.svc:8080")
# Create dataset URN
dataset_urn = make_dataset_urn(
platform="postgres",
name="backstage.users",
env="PROD"
)
# Add tags
tags = GlobalTagsClass(
tags=[
TagAssociationClass(tag="urn:li:tag:PII"),
TagAssociationClass(tag="urn:li:tag:Critical")
]
)
# Emit metadata
emitter.emit_mcp(
entityType="dataset",
entityUrn=dataset_urn,
aspectName="globalTags",
aspect=tags
)
Understanding Lineage Graphs
Data lineage shows how data flows through your systems, helping you understand:
- Upstream dependencies: What data sources feed into this dataset?
- Downstream consumers: What processes or dashboards depend on this dataset?
- Transformation logic: How data changes as it moves through pipelines
Viewing Lineage
- Navigate to a Dataset: Search for the dataset
- Click "Lineage" Tab: View the lineage graph
- Explore the Graph:
- Upstream: Click the left arrow to see data sources
- Downstream: Click the right arrow to see consumers
- Depth: Adjust the depth slider to show more hops
Lineage Graph Elements
┌────────────────┐
│ Source Table 1 │─────┐
└────────────────┘ │
▼
┌────────────────┐ ┌──────────────┐ ┌──────────────┐
│ Source Table 2 │──▶│ Transform │──▶│ Target Table │
└────────────────┘ │ (dbt model) │ └──────────────┘
└──────────────┘ │
▼
┌────────────────┐
│ Dashboard │
└────────────────┘
- Blue boxes: Datasets (tables, views)
- Green boxes: Transformations (dbt models, pipelines)
- Purple boxes: Dashboards and reports
- Arrows: Data flow direction
Adding Lineage
Using Python SDK
from datahub.emitter.mce_builder import make_dataset_urn, make_data_flow_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import UpstreamLineageClass, UpstreamClass
# Define upstream datasets
upstream = UpstreamLineageClass(
upstreams=[
UpstreamClass(
dataset=make_dataset_urn("postgres", "backstage.services"),
type="TRANSFORMED"
),
UpstreamClass(
dataset=make_dataset_urn("postgres", "backstage.catalog"),
type="TRANSFORMED"
)
]
)
# Emit lineage
emitter.emit_mcp(
entityType="dataset",
entityUrn=make_dataset_urn("postgres", "analytics.service_stats"),
aspectName="upstreamLineage",
aspect=upstream
)
Using dbt Integration
DataHub automatically extracts lineage from dbt models:
# dbt_project.yml
models:
fawkes:
+meta:
datahub:
enabled: true
# Generate and ingest dbt metadata
dbt docs generate
datahub ingest -c dbt-recipe.yml
Common Use Cases
Use Case 1: Finding PII Data
Scenario: You need to identify all tables containing personally identifiable information for GDPR compliance.
Steps:
- Search for
tag:PIIin the search bar - Review all results
- Verify ownership and access controls
- Document data retention policies
Use Case 2: Impact Analysis Before Schema Change
Scenario: You want to change a column in the users table but need to know what will break.
Steps:
- Navigate to the
usersdataset - Click the "Lineage" tab
- Set depth to 3-4 hops
- Identify all downstream consumers
- Coordinate with downstream owners before making changes
Use Case 3: Data Quality Monitoring
Scenario: Track data quality metrics for critical datasets.
Steps:
- Integrate Great Expectations with DataHub
- Run validation checks
- View results in DataHub UI under "Validation" tab
- Set up alerts for failures
Use Case 4: New Team Member Onboarding
Scenario: A new data engineer joins and needs to understand data assets.
Steps:
- Browse by domain (e.g., "User Analytics")
- Review dataset descriptions and schemas
- Check lineage to understand data flows
- Contact dataset owners for questions
Troubleshooting Common Issues
Issue 1: DataHub UI Not Loading
Symptoms: Browser shows connection error or blank page
Checks:
# Check if pods are running
kubectl get pods -n fawkes | grep datahub
# Check frontend logs
kubectl logs -n fawkes deployment/datahub-datahub-frontend -f
# Check GMS logs
kubectl logs -n fawkes deployment/datahub-datahub-gms -f
# Check ingress
kubectl get ingress -n fawkes datahub-datahub-frontend
Solutions:
- Verify PostgreSQL is running:
kubectl get pods -n fawkes | grep db-datahub - Verify OpenSearch is accessible:
kubectl get pods -n logging | grep opensearch - Check resource limits aren't exceeded
- Review pod events:
kubectl describe pod -n fawkes <pod-name>
Issue 2: Search Not Returning Results
Symptoms: Metadata exists but search returns empty results
Checks:
# Check OpenSearch is healthy
kubectl exec -it -n logging opensearch-cluster-master-0 -- \
curl -XGET "http://localhost:9200/_cluster/health?pretty"
# Check indices exist
kubectl exec -it -n logging opensearch-cluster-master-0 -- \
curl -XGET "http://localhost:9200/_cat/indices?v"
Solutions:
- Rebuild search index:
# Access DataHub pod and trigger reindex kubectl exec -n fawkes deployment/datahub-datahub-gms -- \ curl -X POST http://localhost:8080/gms/operations?action=restoreIndices - Verify OpenSearch connectivity from DataHub pods
- Check OpenSearch resource limits
Issue 3: Ingestion Failures
Symptoms: DataHub CLI ingestion fails or doesn't show metadata
Checks:
# Run with debug mode
datahub ingest -c recipe.yml --debug
# Check GMS logs during ingestion
kubectl logs -n fawkes deployment/datahub-datahub-gms -f
Solutions:
- Verify database credentials are correct
- Check network connectivity from ingestion source
- Ensure ingestion recipe format is correct
- Review DataHub version compatibility
Issue 4: Slow Performance
Symptoms: UI is slow, searches take long time
Checks:
# Check resource usage
kubectl top pods -n fawkes | grep datahub
kubectl top pods -n logging | grep opensearch
# Check database connections
kubectl exec -it -n fawkes db-datahub-dev-1 -- \
psql -U datahub_user -d datahub -c "SELECT count(*) FROM pg_stat_activity;"
Solutions:
- Increase resource limits for DataHub components
- Scale OpenSearch replicas if needed
- Optimize PostgreSQL queries
- Enable query result caching
Issue 5: Authentication Problems
Symptoms: Cannot log in or access is denied
Checks:
# Check frontend environment variables
kubectl get deployment -n fawkes datahub-datahub-frontend -o yaml | grep -A 10 env
# Check GMS auth settings
kubectl get deployment -n fawkes datahub-datahub-gms -o yaml | grep AUTH
Solutions:
- For MVP: Verify basic auth is enabled (
AUTH_JAAS_ENABLED: false) - For production: Configure OIDC with GitHub OAuth
- Check user exists in DataHub user management
- Reset admin password if needed
Best Practices
1. Metadata Governance
- Assign Owners: Every dataset should have a technical and business owner
- Use Tags Consistently: Establish standard tags (PII, Critical, Deprecated, etc.)
- Add Descriptions: Write clear, concise descriptions for all datasets
- Keep Lineage Updated: Ensure lineage reflects actual data flows
2. Search Optimization
- Use Specific Terms: Search for exact table or column names when possible
- Leverage Filters: Combine keywords with filters for better results
- Save Searches: Bookmark frequently used searches
- Browse by Domain: Organize data assets by business domains
3. Data Quality
- Integrate with Great Expectations: Automate data quality checks
- Monitor Critical Datasets: Set up alerts for important datasets
- Document Quality Issues: Use DataHub to track known issues
- Regular Reviews: Schedule periodic data quality reviews
4. Collaboration
- Use Comments: Discuss datasets directly in DataHub
- Share Knowledge: Add helpful documentation and examples
- Ask Questions: Use the UI to ask dataset owners questions
- Stay Informed: Subscribe to notifications for critical datasets
Integration with Fawkes Platform
DataHub integrates with other Fawkes components:
Backstage Integration
- Link to DataHub from Backstage catalog
- Show data lineage for services
- Display data quality metrics in service pages
DORA Metrics
- Track data pipeline deployment frequency
- Measure time to restore data after incidents
- Monitor change failure rate for data pipelines
Observability
- Send DataHub metrics to Prometheus
- Create Grafana dashboards for metadata health
- Alert on ingestion failures or data quality issues
Related Documentation
- DataHub Official Documentation
- Great Expectations Integration
- PostgreSQL Ingestion
- dbt Integration
- Architecture Documentation
- Data Platform Epic
Getting Help
- Internal Support: Post in
#data-platformMattermost channel - DataHub Community: Slack
- GitHub Issues: DataHub GitHub
- Documentation: Official Docs