Skip to content

fix: bendpy register csv column positions followup#19557

Draft
KKould wants to merge 14 commits intodatabendlabs:mainfrom
KKould:fix/bendpy-register-csv-column-positions-followup
Draft

fix: bendpy register csv column positions followup#19557
KKould wants to merge 14 commits intodatabendlabs:mainfrom
KKould:fix/bendpy-register-csv-column-positions-followup

Conversation

@KKould
Copy link
Member

@KKould KKould commented Mar 17, 2026

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

-fixes: #19443
-from: #19444

register_csv() and register_tsv() generate SELECT * in the underlying CREATE VIEW, which fails because CSV/TSV files require explicit column positions ($1, $2, ...).

Fix: call infer_schema() first to detect column names, then generate SELECT $1 AS col1, $2 AS col2, ... instead of SELECT *.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@KKould KKould changed the title Fix/bendpy register csv column positions followup fix: bendpy register csv column positions followup Mar 17, 2026
@github-actions github-actions bot added the pr-bugfix this PR patches a bug in codebase label Mar 17, 2026
@KKould KKould force-pushed the fix/bendpy-register-csv-column-positions-followup branch from 676ee1b to 2799f8f Compare March 17, 2026 12:00
bohutang and others added 13 commits March 17, 2026 20:41
- Extract resolve_file_path() and extract_string_column() as standalone helpers
- Replace imperative loop with functional iterator chain
- Rename infer_column_names to build_column_select for clarity
- Deduplicate mock logic in test_connections.py via _register_delimited()
…tion

The test_bendpy job runs on a bare self-hosted ARM64 runner without
clang/build-essential installed. Add a setup step that runs dev_setup.sh
to install build dependencies and sets JEMALLOC env vars for Linux
development builds, matching what the macOS path already does.
infer_schema only supports Parquet, CSV, and NDJSON formats. Routing
TSV through build_column_select would fail at runtime. Fall back to
SELECT * for TSV and remove the TSV integration test accordingly.
…nux CI

- Remove braces from single-expression match arm to satisfy cargo fmt
- Replace dev_setup.sh with direct apt-get install to avoid perl errors
  on the self-hosted ARM64 runner
The .cargo/config.toml uses -fuse-ld=mold on Linux targets, so mold
must be installed on the self-hosted runner alongside clang.
infer_schema requires the stage system (system.stage table) which is
not available in bendpy's embedded context. Replace with direct file
reading of the CSV header line to extract column names for building
$1 AS col1, $2 AS col2, ... select clauses.
The embedded bendpy context does not have the system.stage table, so
file-based queries (SELECT FROM 'fs://...') cannot work. Remove the
integration test; SQL generation is already covered by mock tests in
test_connections.py.
@KKould KKould force-pushed the fix/bendpy-register-csv-column-positions-followup branch from 2799f8f to 859d205 Compare March 17, 2026 12:41
@KKould
Copy link
Member Author

KKould commented Mar 17, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad1de80171

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +257 to +263
let fs_path = fs_path_from_uri(file_path).ok_or_else(|| {
PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(format!(
"{} column inference only supports local files (fs://), got: {}",
file_format.to_ascii_uppercase(),
file_path,
))
})?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow CSV/TSV registration with non-fs connections

This rejects every register_csv/register_tsv call that uses a connection-backed URI like s3://...: register_table() now always calls build_column_select() for CSV/TSV, and build_column_select() hard-fails unless the path starts with fs://. In practice, passing connection='my_s3' now returns a Python runtime error before SQL execution, so the connection parameter for CSV/TSV is unusable.

Useful? React with 👍 / 👎.


fn read_delimited_column_names(path: &str, delimiter: char) -> std::io::Result<Vec<String>> {
use std::io::BufRead;
let file = std::fs::File::open(path)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Infer CSV/TSV columns from file path pattern inputs

Pattern-based registration can now fail during inference because header parsing always opens file_path directly as a single file. For common calls like register_csv("logs", "/data/logs/", pattern="*.csv"), this tries to open a directory (/data/logs/) and raises an I/O error (Is a directory) before the view is created. The new inference path needs to account for pattern-style inputs instead of blindly opening the base path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix this PR patches a bug in codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bendpy: register_csv() fails with 'Query from CSV file lacks column positions'

2 participants