Skip to content

ASTERIXDB-PR2: Add schema extraction pipeline for NL2SQL++ (SchemaContextBuilder)#47

Open
pineappleBest123 wants to merge 2 commits intoapache:masterfrom
pineappleBest123:gsoc-pr2-schema-extraction
Open

ASTERIXDB-PR2: Add schema extraction pipeline for NL2SQL++ (SchemaContextBuilder)#47
pineappleBest123 wants to merge 2 commits intoapache:masterfrom
pineappleBest123:gsoc-pr2-schema-extraction

Conversation

@pineappleBest123
Copy link
Copy Markdown

Summary

Add the schema extraction pipeline for the GSoC 2026 NL2SQL++ project.
This patch builds on top of the servlet infrastructure introduced in
#46.

Changes

  • ColumnInfo: field name, type string, and primary-key flag with
    prompt-ready toDescriptionString() output
  • DatasetSchema: holds all columns, supports pruned column subset
    (for ColumnPruner in a later PR) and value hints (for ValueHintsSampler)
  • DatasetSchemaFormatter: recursively converts ADM IAType objects to
    human-readable strings (supports nested records, arrays, multisets,
    nullable unions, depth limit of 4)
  • SchemaContextBuilder: reads Dataset and type metadata from
    MetadataManager, builds a SchemaContext with one description
    string per Dataset, wrapped in a metadata transaction
  • 13 unit tests covering all formatter rules and schema pipeline behavior

Example output

Dataset TweetMessages (tweetid: int64 [PK], sender-location: any,
send-time: datetime, referred-topics: [string], message-text: string,
author-id: int64)

Testing

All unit tests pass: mvn test -pl asterixdb/asterix-spidersilk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant