Skip to content

Prepared statements on DuckDB's Python Client increases execution time almost 10X #300

@FabianBravoA

Description

@FabianBravoA

What happens?

Note: This was tested using DuckDB with several extensions loaded (cache_httpfs, parquet, DuckLake, observefs)

After a change we started using prepared statements (in particular we are using user input on our queries and wanted to protect our data against SQL inyection attacks) we noticed a severe slow down on our system. Tracing the issue we figured out that our DuckDB SQL statements were 10 times more slow after the patch, and after testing the queries with and without using params our conclusion is that using params is ten times slower than a simple query

To Reproduce

            #DuckDB connection creation
            self.connections[thread_id] = duckdb.connect(database)
            
            self.connections[thread_id].install_extension("parquet")
            self.connections[thread_id].install_extension("observefs", repository="community")
            self.connections[thread_id].install_extension("cache_httpfs", repository="community")
            self.connections[thread_id].install_extension("ducklake")
            self.connections[thread_id].install_extension("postgres")
            self.connections[thread_id].load_extension("ducklake")
            self.connections[thread_id].load_extension("cache_httpfs")
            self.connections[thread_id].load_extension("observefs")
            self.connections[thread_id].load_extension("postgres")
            self.connections[thread_id].execute("SET cache_httpfs_type='on_disk'")
            self.connections[thread_id].execute("SET cache_httpfs_cache_directory='/tmp/general_cache_httpfs/'")
            self.connections[thread_id].execute("SET cache_httpfs_disk_cache_reader_enable_memory_cache=true")
            self.connections[thread_id].execute("SET enable_external_file_cache=true")
            self.connections[thread_id].execute("SET enable_http_metadata_cache=true")

            self.connections[thread_id].load_extension("parquet")

            self.connections[thread_id].execute("CREATE OR REPLACE SECRET secret (" \
                +"TYPE s3," \
                +"PROVIDER config," \
                +"KEY_ID '" + os.environ.get("AWS_S3_ACCESS_KEY_ID", "") + "'," \
                +"SECRET '" + os.environ.get("AWS_S3_SECRET_ACCESS_KEY", "") + "'," \
                +"REGION '" + os.environ.get("AWS_REGION", "us-east-1") + "'" \
                +")"
            )

            self.connections[thread_id].execute("CREATE OR REPLACE SECRET ducklake_postgres (" \
                                                +"TYPE postgres," \
                                                +"HOST '" + settings.DATABASES['default']['HOST'] + "'," \
                                                +"DATABASE '" + settings.DATABASES['default']['NAME'] + "'," \
                                                +"USER '" + settings.DATABASES['default']['USER'] + "'," \
                                                +"PASSWORD '" + settings.DATABASES['default']['PASSWORD'] + "'," \
                                                +"PORT " + settings.DATABASES['default']['PORT'] \
                                                +")"
            )

            self.connections[thread_id].execute("CREATE OR REPLACE SECRET columns_ducklake (" \
                                                +"TYPE ducklake," \
                                                +f"DATA_PATH 's3://" + settings.AWS_STORAGE_BUCKET_NAME + f"/{os.environ.get("DUCKLAKE_FOLDER", "ducklake")}/'," \
                                                +"METADATA_PATH ''," \
                                                +"METADATA_PARAMETERS MAP { "\
                                                    +"'TYPE': 'postgres',"\
                                                    +"'SECRET': 'ducklake_postgres',"\
                                                +"}" \
                                                +")"
                                                
            )
            self.connections[thread_id].execute(f"ATTACH 'ducklake:columns_ducklake' AS datalake")
            self.connections[thread_id].execute("USE datalake")
            self.connections[thread_id].execute("CALL datalake.set_option('delete_older_than', '1d')")
            self.connections[thread_id].execute("CALL datalake.set_option('expire_older_than', '1d')")

# where the SQL query is a simple filter over a big parquet file on AWS S3 (200MB) or a ducklake table with S3 as backend partitioned by chunks (same big parquet file but uploaded using ducklake)
filtered_numpy_dict = duckdbConn.sql(sql_query).fetchnumpy()
#This takes 0.9 seconds

# Same SQL query but now with a prepared statement and a dictionary with all values (user input mostly)
filtered_numpy_dict = duckdbConn.sql(sql_query, params=values_dict).fetchnumpy()
#This takes 10 seconds

OS:

Linux/AMD64 Debian Trixie

DuckDB Version:

1.4.4

DuckDB Client:

Python

Hardware:

Docker running on AWS ECS, Mac M4

Full Name:

Fabián Alexis Bravo Abarca

Affiliation:

AlicantoLabs

Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?

  • Yes, I have

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions