|
|
--- |
|
|
title: Lineage Graph Accelerator |
|
|
emoji: π₯ |
|
|
colorFrom: purple |
|
|
colorTo: blue |
|
|
sdk: gradio |
|
|
sdk_version: 6.0.0 |
|
|
app_file: app.py |
|
|
pinned: true |
|
|
license: mit |
|
|
short_description: AI data lineage extraction & export to data catalogs |
|
|
tags: |
|
|
- data-lineage |
|
|
- mcp |
|
|
- gradio |
|
|
- data-governance |
|
|
- dbt |
|
|
- airflow |
|
|
- etl |
|
|
- mcp-in-action-track-productivity |
|
|
- hackathon |
|
|
--- |
|
|
|
|
|
# Lineage Graph Accelerator π₯ |
|
|
|
|
|
**AI-powered data lineage extraction and visualization for modern data platforms** |
|
|
|
|
|
[](https://huggingface.co/spaces/aamanlamba/Lineage-graph-accelerator) |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://gradio.app) |
|
|
|
|
|
> π **Built for the Gradio Agents & MCP Hackathon - Winter 2025** π |
|
|
> |
|
|
> Celebrating MCP's 1st Birthday! This project demonstrates the power of MCP integration for enterprise data governance. |
|
|
|
|
|
--- |
|
|
|
|
|
## π What is Lineage Graph Accelerator? |
|
|
|
|
|
Lineage Graph Accelerator is an AI-powered tool that helps data teams: |
|
|
|
|
|
- **Extract** data lineage from dbt, Airflow, BigQuery, Snowflake, and more |
|
|
- **Visualize** complex data dependencies with interactive Mermaid diagrams |
|
|
- **Export** lineage to enterprise data catalogs (Collibra, Microsoft Purview, Alation) |
|
|
- **Integrate** with MCP servers for enhanced AI-powered processing |
|
|
|
|
|
### Why Data Lineage Matters |
|
|
|
|
|
Understanding where your data comes from and where it goes is critical for: |
|
|
- **Data Quality**: Track data transformations and identify issues |
|
|
- **Compliance**: Document data flows for GDPR, CCPA, and other regulations |
|
|
- **Impact Analysis**: Understand downstream effects of schema changes |
|
|
- **Data Discovery**: Help analysts find and trust data assets |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Key Features |
|
|
|
|
|
### Multi-Source Support |
|
|
| Source | Status | Description | |
|
|
|--------|--------|-------------| |
|
|
| dbt Manifest | β
| Parse dbt's manifest.json for model dependencies | |
|
|
| Airflow DAG | β
| Extract task dependencies from DAG definitions | |
|
|
| SQL DDL | β
| Parse CREATE statements for table lineage | |
|
|
| BigQuery | β
| Query INFORMATION_SCHEMA for metadata | |
|
|
| Custom JSON | β
| Flexible node/edge format for any source | |
|
|
| Snowflake | π | Coming via MCP integration | |
|
|
|
|
|
### Export to Data Catalogs |
|
|
| Catalog | Status | Format | |
|
|
|---------|--------|--------| |
|
|
| OpenLineage | β
| Universal open standard | |
|
|
| Collibra | β
| Data Intelligence Platform | |
|
|
| Microsoft Purview | β
| Azure Data Governance | |
|
|
| Alation | β
| Data Catalog | |
|
|
| Apache Atlas | π | Coming soon | |
|
|
|
|
|
### Visualization Options |
|
|
- **Mermaid Diagrams**: Interactive, client-side rendering |
|
|
- **Subgraph Grouping**: Organize by data layer (raw, staging, marts) |
|
|
- **Color-Coded Nodes**: Distinguish sources, tables, models, reports |
|
|
- **Edge Labels**: Show transformation types |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Try Online (HuggingFace Space) |
|
|
|
|
|
1. Visit [Lineage Graph Accelerator on HuggingFace](https://huggingface.co/spaces/YOUR_SPACE) |
|
|
2. Click "Load Sample" to load example data |
|
|
3. Click "Extract Lineage" to see the visualization |
|
|
4. Explore the Demo Gallery for more examples |
|
|
|
|
|
### Run Locally |
|
|
|
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://github.com/YOUR_REPO/lineage-graph-accelerator.git |
|
|
cd lineage-graph-accelerator |
|
|
|
|
|
# Create virtual environment |
|
|
python3 -m venv .venv |
|
|
source .venv/bin/activate |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Run the app |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
Open http://127.0.0.1:7860 in your browser. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Usage Guide |
|
|
|
|
|
### 1. Text/File Metadata Tab |
|
|
|
|
|
Paste your metadata directly: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"nodes": [ |
|
|
{"id": "source_db", "type": "source", "name": "Source Database"}, |
|
|
{"id": "staging", "type": "table", "name": "Staging Table"}, |
|
|
{"id": "analytics", "type": "table", "name": "Analytics Table"} |
|
|
], |
|
|
"edges": [ |
|
|
{"from": "source_db", "to": "staging"}, |
|
|
{"from": "staging", "to": "analytics"} |
|
|
] |
|
|
} |
|
|
``` |
|
|
|
|
|
### 2. Sample Data |
|
|
|
|
|
Load pre-built samples to explore different scenarios: |
|
|
- **Simple JSON**: Basic node/edge lineage |
|
|
- **dbt Manifest**: Full dbt project with 15+ models |
|
|
- **Airflow DAG**: ETL pipeline with 15 tasks |
|
|
- **Data Warehouse**: Snowflake-style multi-layer architecture |
|
|
- **ETL Pipeline**: Complex multi-source pipeline |
|
|
- **Complex Demo**: 50+ node e-commerce platform |
|
|
|
|
|
### 3. Export to Data Catalogs |
|
|
|
|
|
1. Extract lineage from your metadata |
|
|
2. Expand "Export to Data Catalog" |
|
|
3. Select format (OpenLineage, Collibra, Purview, Alation) |
|
|
4. Click "Generate Export" |
|
|
5. Copy the JSON for import into your catalog |
|
|
|
|
|
--- |
|
|
|
|
|
## π MCP Integration |
|
|
|
|
|
Connect to MCP (Model Context Protocol) servers for enhanced processing: |
|
|
|
|
|
``` |
|
|
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ |
|
|
β Lineage Graph ββββββΆβ MCP Server ββββββΆβ AI Model β |
|
|
β Accelerator β β (HuggingFace) β β (Claude) β |
|
|
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ |
|
|
``` |
|
|
|
|
|
### Configuration |
|
|
|
|
|
1. Expand "MCP Server Configuration" in the UI |
|
|
2. Enter your MCP server URL |
|
|
3. Add API key (if required) |
|
|
4. Click "Test Connection" |
|
|
|
|
|
### Run Local MCP Server |
|
|
|
|
|
```bash |
|
|
uvicorn mcp_example.server:app --reload --port 9000 |
|
|
``` |
|
|
|
|
|
Then use `http://localhost:9000/mcp` as your server URL. |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Architecture |
|
|
|
|
|
```mermaid |
|
|
flowchart TD |
|
|
A[User Interface - Gradio] --> B[Input Parser] |
|
|
B --> C{Source Type} |
|
|
C -->|dbt| D[dbt Parser] |
|
|
C -->|Airflow| E[Airflow Parser] |
|
|
C -->|SQL| F[SQL Parser] |
|
|
C -->|JSON| G[JSON Parser] |
|
|
D & E & F & G --> H[LineageGraph] |
|
|
H --> I[Mermaid Generator] |
|
|
H --> J[Export Engine] |
|
|
I --> K[Visualization] |
|
|
J --> L[OpenLineage] |
|
|
J --> M[Collibra] |
|
|
J --> N[Purview] |
|
|
J --> O[Alation] |
|
|
|
|
|
subgraph Optional |
|
|
P[MCP Server] --> H |
|
|
end |
|
|
``` |
|
|
|
|
|
### Project Structure |
|
|
|
|
|
``` |
|
|
lineage-graph-accelerator/ |
|
|
βββ app.py # Main Gradio application |
|
|
βββ exporters/ # Data catalog exporters |
|
|
β βββ __init__.py |
|
|
β βββ base.py # Base classes |
|
|
β βββ openlineage.py # OpenLineage format |
|
|
β βββ collibra.py # Collibra format |
|
|
β βββ purview.py # Microsoft Purview format |
|
|
β βββ alation.py # Alation format |
|
|
βββ samples/ # Sample data files |
|
|
β βββ sample_metadata.json |
|
|
β βββ dbt_manifest_sample.json |
|
|
β βββ airflow_dag_sample.json |
|
|
β βββ sql_ddl_sample.sql |
|
|
β βββ warehouse_lineage_sample.json |
|
|
β βββ etl_pipeline_sample.json |
|
|
β βββ complex_lineage_demo.json |
|
|
βββ mcp_example/ # Example MCP server |
|
|
β βββ server.py |
|
|
βββ tests/ # Unit tests |
|
|
β βββ test_app.py |
|
|
βββ memories/ # Agent configuration |
|
|
βββ USER_GUIDE.md # Comprehensive user guide |
|
|
βββ BUILD_PLAN.md # Development roadmap |
|
|
βββ requirements.txt |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ Testing |
|
|
|
|
|
```bash |
|
|
# Activate virtual environment |
|
|
source .venv/bin/activate |
|
|
|
|
|
# Run unit tests |
|
|
python -m unittest tests.test_app -v |
|
|
|
|
|
# Run setup validation |
|
|
python test_setup.py |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Requirements |
|
|
|
|
|
- Python 3.9+ |
|
|
- Gradio 5.49.1+ |
|
|
- See `requirements.txt` for full dependencies |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Competition Submission |
|
|
|
|
|
**Track**: Track 2 - MCP in Action (Productivity) |
|
|
|
|
|
**Team Members**: |
|
|
|
|
|
- [Aaman Lamba](https://aamanlamba.com) | [HuggingFace](https://huggingface.co/aamanlamba) | [GitHub](https://github.com/aamanlamba) |
|
|
|
|
|
### Judging Criteria Alignment |
|
|
|
|
|
| Criteria | Implementation | |
|
|
|----------|----------------| |
|
|
| **UI/UX Design** | Clean, professional interface with tabs, accordions, and color-coded visualizations | |
|
|
| **Functionality** | Full MCP integration, multiple input formats, 5 export formats | |
|
|
| **Creativity** | Novel approach to data lineage visualization with AI-powered parsing | |
|
|
| **Documentation** | Comprehensive README, USER_GUIDE.md, inline comments | |
|
|
| **Real-world Impact** | Solves critical enterprise need for data governance and compliance | |
|
|
|
|
|
### Demo Video |
|
|
|
|
|
πΊ **YouTube**: [Watch the Demo](https://youtu.be/U4Dfc7txa_0) |
|
|
π₯ **Loom**: [Alternative Link](https://www.loom.com/share/3de27e88e01f4e97bfd13e4f0031f416) |
|
|
|
|
|
**Highlights**: |
|
|
|
|
|
- AI Assistant with Google Gemini generating lineage from natural language |
|
|
- MCP Integration with Local Demo server |
|
|
- Demo Gallery with 50+ node complex pipelines |
|
|
- Export to Collibra, Purview, and Apache Atlas |
|
|
- Interactive Mermaid visualizations with zoom and download |
|
|
|
|
|
### Social Media Post |
|
|
|
|
|
π± **LinkedIn**: [View the announcement post](https://www.linkedin.com/posts/aamanlamba_lineage-graph-accelerator-a-hugging-face-activity-7400658296166297600-n9a6) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Roadmap |
|
|
|
|
|
- [x] Gradio 6 upgrade for enhanced UI components |
|
|
- [x] Agentic chatbot for natural language queries (Google Gemini) |
|
|
- [x] Apache Atlas export support |
|
|
- [ ] File upload functionality |
|
|
- [x] Graph export as PNG/SVG |
|
|
- [ ] Batch processing API |
|
|
- [ ] Column-level lineage |
|
|
|
|
|
--- |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
Contributions welcome! Please: |
|
|
|
|
|
1. Fork the repository |
|
|
2. Create a feature branch |
|
|
3. Make your changes |
|
|
4. Submit a pull request |
|
|
|
|
|
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
MIT License - see [LICENSE](LICENSE) for details. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- **Anthropic** - MCP Protocol and Claude |
|
|
- **Gradio Team** - Amazing UI framework |
|
|
- **HuggingFace** - Hosting and community |
|
|
- **dbt Labs** - Inspiration for metadata standards |
|
|
- **OpenLineage** - Open lineage specification |
|
|
|
|
|
--- |
|
|
|
|
|
## π Support |
|
|
|
|
|
- **Documentation**: [USER_GUIDE.md](USER_GUIDE.md) |
|
|
- **Author Website**: [aamanlamba.com](https://aamanlamba.com) |
|
|
- **Issues**: [GitHub Issues](https://github.com/aamanlamba/lineage-graph-accelerator/issues) |
|
|
- **Discussion**: [HuggingFace Community](https://huggingface.co/spaces/aamanlamba/Lineage-graph-accelerator/discussions) |
|
|
|
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
Built with β€οΈ by <a href="https://aamanlamba.com"><strong>Aaman Lamba</strong></a> for the <strong>Gradio Agents & MCP Hackathon - Winter 2025</strong> |
|
|
<br> |
|
|
Celebrating MCP's 1st Birthday! π |
|
|
</p> |
|
|
|