-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Description
Context
Code-to-data relationships are among the most commonly needed by AI agents. An agent debugging a data issue needs to know what code produces a table, when it last changed, and what the deploy pipeline looks like.
Scope
New GitHub extractor that captures:
- Repositories — name, description, language, topics, visibility
- Code ownership — CODEOWNERS file parsing, team-to-path mappings
- CI/CD pipelines — GitHub Actions workflows, their triggers, and what they deploy
- Pull requests — recent merges that changed data-related code (dbt models, migrations, pipeline configs)
- Relationships — repo → assets it produces (via convention or config), team → repo ownership
Approach
- Use GitHub API (REST or GraphQL) for metadata extraction
- Support organization-wide scanning or per-repo configuration
- Emit typed relationships linking repos/code to data assets where inferrable
- Rate limiting and pagination handling for large organizations
Why
GitHub is likely the highest-value new extractor because it connects code to data assets — the relationship most commonly asked about by AI agents. "What code produces this table?" and "when did this pipeline last change?" are foundational questions.
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels