Skip to content

cloudera/spiderman

SpiderMan

A comprehensive, high-quality, human-annotated plain-text dataset for SQL AI tasks across diverse domains and complexity levels.

Table of Contents

Why SpiderMan

SpiderMan is an improved version of the Spider 1.0 dataset.

  • The databases are made available in plain-text format instead of a set of SQLite files. This makes it easy for you to load the dataset into any database of your choice.
  • The schema has been standardized, including the correction of table ordering, column data types, and the enforcement of primary and foreign key constraints.
  • Data has been corrected for schema-based validations.
  • Queries have been improved for successful execution.

Dataset

The dataset comprises 157 databases. Each one comes with its respective schema, data, and queries. By default, schema and queries are in MySQL dialect and can be translated to other dialects using the transpiler script. At present, our queries do not extend across multiple databases. Each query within a single database is assigned exclusively to either the training set or the test set, but not to both.

Queries Tables Databases
Train 4663 699 137
Test 682 80 20
Total 5345 779 157

Setup

The project needs uv to be installed to manage dependencies.

Setup Database

We need a database to load the dataset into and run the queries. MySQL was chosen as the default dialect because it is one of the most widely used, can be set up quickly using docker.

docker run --name spiderman-mysql -e MYSQL_ROOT_PASSWORD=PeterParker -p 3306:3306 -d mysql:9.0.0

Load Dataset

uv run scripts/load_dataset.py 'mysql+mysqlconnector://root:PeterParker@localhost:3306'

It creates schemas for all the databases and populates them with data.

Scripts

All scripts can be run using uv run which automatically manages the virtual environment:

Note: All the URL follows the SQLAlchemy database URL format - dialect+driver://username:password@host:port/database_name. More details on the URL are available in this doc.

Run Benchmark

Pre-computed results and benchmarks are included in the repository. However, you can use the following scripts to generate your own results.

Run SQL AI Tasks

uv run scripts/run_sqlai.py mysql http://127.0.0.1:8000 "<Model details>"

This script generates SQL queries using a SQL AI service and saves the results to a JSON file for test queries of a specific dialect. Run with -h to see all available options.

Generate Report

uv run scripts/generate_report.py 'mysql+mysqlconnector://root:PeterParker@localhost:3306'

This script analyzes the SQL results for test queries of a specific dialect and generates a markdown report. Try running it with -h for the full list of arguments.

Dataset Scripts

Scan Dataset

uv run scripts/scan_dataset.py mysql

This script goes through the dataset and aggregates various details such as the number of queries, tables, and databases.

Execute Queries

uv run scripts/execute_queries.py 'mysql+mysqlconnector://root:PeterParker@localhost:3306'

It executes the queries and checks for the successful completion. Query results are not verified at this point.

Validate Queries

uv run scripts/validate_queries.py mysql

This script validates the queries using an LLM and writes the results to a JSON file in the respective dataset directory. Environment variable AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, and AZURE_OPENAI_MODEL must be set.

Citation

If you find this to be useful, please consider citing:

@inproceedings{SpiderMan,
 title  = {SpiderMan: A Comprehensive Human-Annotated Dataset for SQL AI Tasks Across Diverse Domains and Complexity Levels},
 author = {Sreenath Somarajapuram and Athira},
 year   = 2024
}

@inproceedings{Yu&al.18c,
 title     = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
 author    = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev}
 booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
 address   = "Brussels, Belgium",
 publisher = "Association for Computational Linguistics",
 year      = 2018
}

License

About

A comprehensive, high quality, human-annotated plain-text dataset for SQL AI tasks across diverse domains and complexity levels.

Resources

License

CC-BY-SA-4.0, Apache-2.0 licenses found

Licenses found

CC-BY-SA-4.0
LICENSE
Apache-2.0
SCRIPTS_LICENSE

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors