fix: bendpy register csv column positions followup by KKould · Pull Request #19557 · databendlabs/databend

KKould · 2026-03-17T11:34:38Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

-fixes: #19443
-from: #19444

register_csv() and register_tsv() generate SELECT * in the underlying CREATE VIEW, which fails because CSV/TSV files require explicit column positions ($1, $2, ...).

Fix: call infer_schema() first to detect column names, then generate SELECT $1 AS col1, $2 AS col2, ... instead of SELECT *.

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

…r_csv/register_tsv Fixes databendlabs#19443

- Extract resolve_file_path() and extract_string_column() as standalone helpers - Replace imperative loop with functional iterator chain - Rename infer_column_names to build_column_select for clarity - Deduplicate mock logic in test_connections.py via _register_delimited()

…elect

…tion The test_bendpy job runs on a bare self-hosted ARM64 runner without clang/build-essential installed. Add a setup step that runs dev_setup.sh to install build dependencies and sets JEMALLOC env vars for Linux development builds, matching what the macOS path already does.

infer_schema only supports Parquet, CSV, and NDJSON formats. Routing TSV through build_column_select would fail at runtime. Fall back to SELECT * for TSV and remove the TSV integration test accordingly.

…nux CI - Remove braces from single-expression match arm to satisfy cargo fmt - Replace dev_setup.sh with direct apt-get install to avoid perl errors on the self-hosted ARM64 runner

The .cargo/config.toml uses -fuse-ld=mold on Linux targets, so mold must be installed on the self-hosted runner alongside clang.

infer_schema requires the stage system (system.stage table) which is not available in bendpy's embedded context. Replace with direct file reading of the CSV header line to extract column names for building $1 AS col1, $2 AS col2, ... select clauses.

The embedded bendpy context does not have the system.stage table, so file-based queries (SELECT FROM 'fs://...') cannot work. Remove the integration test; SQL generation is already covered by mock tests in test_connections.py.

KKould · 2026-03-17T17:29:41Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad1de80171

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-17T17:37:24Z

src/bendpy/src/context.rs

+        let fs_path = fs_path_from_uri(file_path).ok_or_else(|| {
+            PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(format!(
+                "{} column inference only supports local files (fs://), got: {}",
+                file_format.to_ascii_uppercase(),
+                file_path,
+            ))
+        })?;


Allow CSV/TSV registration with non-fs connections

This rejects every register_csv/register_tsv call that uses a connection-backed URI like s3://...: register_table() now always calls build_column_select() for CSV/TSV, and build_column_select() hard-fails unless the path starts with fs://. In practice, passing connection='my_s3' now returns a Python runtime error before SQL execution, so the connection parameter for CSV/TSV is unusable.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-17T17:37:24Z

src/bendpy/src/context.rs

+
+fn read_delimited_column_names(path: &str, delimiter: char) -> std::io::Result<Vec<String>> {
+    use std::io::BufRead;
+    let file = std::fs::File::open(path)?;


Infer CSV/TSV columns from file path pattern inputs

Pattern-based registration can now fail during inference because header parsing always opens file_path directly as a single file. For common calls like register_csv("logs", "/data/logs/", pattern="*.csv"), this tries to open a directory (/data/logs/) and raises an I/O error (Is a directory) before the view is created. The new inference path needs to account for pattern-style inputs instead of blindly opening the base path.

Useful? React with 👍 / 👎.

KKould changed the title ~~Fix/bendpy register csv column positions followup~~ fix: bendpy register csv column positions followup Mar 17, 2026

github-actions bot added the pr-bugfix this PR patches a bug in codebase label Mar 17, 2026

KKould force-pushed the fix/bendpy-register-csv-column-positions-followup branch from 676ee1b to 2799f8f Compare March 17, 2026 12:00

bohutang and others added 13 commits March 17, 2026 20:41

fix(bendpy): use infer_schema for CSV/TSV column positions in registe…

a6b28e8

…r_csv/register_tsv Fixes databendlabs#19443

chore: fix code formatting for rustfmt

efe8d61

fix(bendpy): pass pattern parameter to infer_schema in build_column_s…

daeab3d

…elect

ci: add bendpy Python tests to PR CI

9a2d9a5

fix: apply rustfmt formatting for match arm in context.rs

9c799ae

fix(bendpy): remove TSV from infer_schema path since it is unsupported

2c0a216

infer_schema only supports Parquet, CSV, and NDJSON formats. Routing TSV through build_column_select would fail at runtime. Fall back to SELECT * for TSV and remove the TSV integration test accordingly.

fix: rustfmt match arm and use apt-get instead of dev_setup.sh for Li…

e2a63f9

…nux CI - Remove braces from single-expression match arm to satisfy cargo fmt - Replace dev_setup.sh with direct apt-get install to avoid perl errors on the self-hosted ARM64 runner

fix(ci): add mold linker to Linux build dependencies

58f40d4

The .cargo/config.toml uses -fuse-ld=mold on Linux targets, so mold must be installed on the self-hosted runner alongside clang.

fix: support register_tsv column inference

859d205

KKould force-pushed the fix/bendpy-register-csv-column-positions-followup branch from 2799f8f to 859d205 Compare March 17, 2026 12:41

fix: avoid maturin strip conflict in dev builds

ad1de80

chatgpt-codex-connector bot reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bendpy register csv column positions followup#19557

fix: bendpy register csv column positions followup#19557
KKould wants to merge 14 commits intodatabendlabs:mainfrom
KKould:fix/bendpy-register-csv-column-positions-followup

KKould commented Mar 17, 2026 •

edited by drmingdrmer

Loading

Uh oh!

KKould commented Mar 17, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 17, 2026

Uh oh!

chatgpt-codex-connector bot Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KKould commented Mar 17, 2026 • edited by drmingdrmer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Type of change

Uh oh!

KKould commented Mar 17, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KKould commented Mar 17, 2026 •

edited by drmingdrmer

Loading