Skip to content

Add GitHub extractor for repositories, code ownership, and CI/CD #502

@ravisuhag

Description

@ravisuhag

Context

Code-to-data relationships are among the most commonly needed by AI agents. An agent debugging a data issue needs to know what code produces a table, when it last changed, and what the deploy pipeline looks like.

Scope

New GitHub extractor that captures:

  • Repositories — name, description, language, topics, visibility
  • Code ownership — CODEOWNERS file parsing, team-to-path mappings
  • CI/CD pipelines — GitHub Actions workflows, their triggers, and what they deploy
  • Pull requests — recent merges that changed data-related code (dbt models, migrations, pipeline configs)
  • Relationships — repo → assets it produces (via convention or config), team → repo ownership

Approach

  • Use GitHub API (REST or GraphQL) for metadata extraction
  • Support organization-wide scanning or per-repo configuration
  • Emit typed relationships linking repos/code to data assets where inferrable
  • Rate limiting and pagination handling for large organizations

Why

GitHub is likely the highest-value new extractor because it connects code to data assets — the relationship most commonly asked about by AI agents. "What code produces this table?" and "when did this pipeline last change?" are foundational questions.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions