SpiderMan

A comprehensive, high-quality, human-annotated plain-text dataset for SQL AI tasks across diverse domains and complexity levels.

Why SpiderMan

SpiderMan is an improved version of the Spider 1.0 dataset.

The databases are made available in plain-text format instead of a set of SQLite files. This makes it easy for you to load the dataset into any database of your choice.
The schema has been standardized, including the correction of table ordering, column data types, and the enforcement of primary and foreign key constraints.
Data has been corrected for schema-based validations.
Queries have been improved for successful execution.

Dataset

The dataset comprises 157 databases. Each one comes with its respective schema, data, and queries. By default, schema and queries are in MySQL dialect and can be translated to other dialects using the transpiler script. At present, our queries do not extend across multiple databases. Each query within a single database is assigned exclusively to either the training set or the test set, but not to both.

	Queries	Tables	Databases
Train	4663	699	137
Test	682	80	20
Total	5345	779	157

Setup

The project needs uv to be installed to manage dependencies.

Setup Database

We need a database to load the dataset into and run the queries. MySQL was chosen as the default dialect because it is one of the most widely used, can be set up quickly using docker.

docker run --name spiderman-mysql -e MYSQL_ROOT_PASSWORD=PeterParker -p 3306:3306 -d mysql:9.0.0

Load Dataset

uv run scripts/load_dataset.py 'mysql+mysqlconnector://root:PeterParker@localhost:3306'

It creates schemas for all the databases and populates them with data.

Scripts

All scripts can be run using uv run which automatically manages the virtual environment:

Note: All the URL follows the SQLAlchemy database URL format - dialect+driver://username:password@host:port/database_name. More details on the URL are available in this doc.

Run Benchmark

Pre-computed results and benchmarks are included in the repository. However, you can use the following scripts to generate your own results.

Run SQL AI Tasks

uv run scripts/run_sqlai.py mysql http://127.0.0.1:8000 "<Model details>"

This script generates SQL queries using a SQL AI service and saves the results to a JSON file for test queries of a specific dialect. Run with -h to see all available options.

Generate Report

uv run scripts/generate_report.py 'mysql+mysqlconnector://root:PeterParker@localhost:3306'

This script analyzes the SQL results for test queries of a specific dialect and generates a markdown report. Try running it with -h for the full list of arguments.

Dataset Scripts

Scan Dataset

uv run scripts/scan_dataset.py mysql

This script goes through the dataset and aggregates various details such as the number of queries, tables, and databases.

Execute Queries

uv run scripts/execute_queries.py 'mysql+mysqlconnector://root:PeterParker@localhost:3306'

It executes the queries and checks for the successful completion. Query results are not verified at this point.

Validate Queries

uv run scripts/validate_queries.py mysql

This script validates the queries using an LLM and writes the results to a JSON file in the respective dataset directory. Environment variable AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, and AZURE_OPENAI_MODEL must be set.

Citation

If you find this to be useful, please consider citing:

@inproceedings{SpiderMan,
 title  = {SpiderMan: A Comprehensive Human-Annotated Dataset for SQL AI Tasks Across Diverse Domains and Complexity Levels},
 author = {Sreenath Somarajapuram and Athira},
 year   = 2024
}

@inproceedings{Yu&al.18c,
 title     = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
 author    = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev}
 booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
 address   = "Brussels, Belgium",
 publisher = "Association for Computational Linguistics",
 year      = 2018
}

License

Dataset license : CC BY-SA 4.0
Scripts license : Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
benchmark_mysql		benchmark_mysql
dataset_mysql		dataset_mysql
insights/notebooks		insights/notebooks
readmes		readmes
scripts		scripts
source		source
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SCRIPTS_LICENSE		SCRIPTS_LICENSE
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpiderMan

Table of Contents

Why SpiderMan

Dataset

Setup

Setup Database

Load Dataset

Scripts

Run Benchmark

Run SQL AI Tasks

Generate Report

Dataset Scripts

Scan Dataset

Execute Queries

Validate Queries

Citation

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpiderMan

Table of Contents

Why SpiderMan

Dataset

Setup

Setup Database

Load Dataset

Scripts

Run Benchmark

Run SQL AI Tasks

Generate Report

Dataset Scripts

Scan Dataset

Execute Queries

Validate Queries

Citation

License

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages