fix: bendpy register csv column positions followup#19557
fix: bendpy register csv column positions followup#19557KKould wants to merge 14 commits intodatabendlabs:mainfrom
Conversation
676ee1b to
2799f8f
Compare
…r_csv/register_tsv Fixes databendlabs#19443
- Extract resolve_file_path() and extract_string_column() as standalone helpers - Replace imperative loop with functional iterator chain - Rename infer_column_names to build_column_select for clarity - Deduplicate mock logic in test_connections.py via _register_delimited()
…tion The test_bendpy job runs on a bare self-hosted ARM64 runner without clang/build-essential installed. Add a setup step that runs dev_setup.sh to install build dependencies and sets JEMALLOC env vars for Linux development builds, matching what the macOS path already does.
infer_schema only supports Parquet, CSV, and NDJSON formats. Routing TSV through build_column_select would fail at runtime. Fall back to SELECT * for TSV and remove the TSV integration test accordingly.
…nux CI - Remove braces from single-expression match arm to satisfy cargo fmt - Replace dev_setup.sh with direct apt-get install to avoid perl errors on the self-hosted ARM64 runner
The .cargo/config.toml uses -fuse-ld=mold on Linux targets, so mold must be installed on the self-hosted runner alongside clang.
infer_schema requires the stage system (system.stage table) which is not available in bendpy's embedded context. Replace with direct file reading of the CSV header line to extract column names for building $1 AS col1, $2 AS col2, ... select clauses.
The embedded bendpy context does not have the system.stage table, so file-based queries (SELECT FROM 'fs://...') cannot work. Remove the integration test; SQL generation is already covered by mock tests in test_connections.py.
2799f8f to
859d205
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad1de80171
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let fs_path = fs_path_from_uri(file_path).ok_or_else(|| { | ||
| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(format!( | ||
| "{} column inference only supports local files (fs://), got: {}", | ||
| file_format.to_ascii_uppercase(), | ||
| file_path, | ||
| )) | ||
| })?; |
There was a problem hiding this comment.
Allow CSV/TSV registration with non-fs connections
This rejects every register_csv/register_tsv call that uses a connection-backed URI like s3://...: register_table() now always calls build_column_select() for CSV/TSV, and build_column_select() hard-fails unless the path starts with fs://. In practice, passing connection='my_s3' now returns a Python runtime error before SQL execution, so the connection parameter for CSV/TSV is unusable.
Useful? React with 👍 / 👎.
|
|
||
| fn read_delimited_column_names(path: &str, delimiter: char) -> std::io::Result<Vec<String>> { | ||
| use std::io::BufRead; | ||
| let file = std::fs::File::open(path)?; |
There was a problem hiding this comment.
Infer CSV/TSV columns from file path pattern inputs
Pattern-based registration can now fail during inference because header parsing always opens file_path directly as a single file. For common calls like register_csv("logs", "/data/logs/", pattern="*.csv"), this tries to open a directory (/data/logs/) and raises an I/O error (Is a directory) before the view is created. The new inference path needs to account for pattern-style inputs instead of blindly opening the base path.
Useful? React with 👍 / 👎.
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
-fixes: #19443
-from: #19444
register_csv()andregister_tsv()generateSELECT *in the underlyingCREATE VIEW, which fails because CSV/TSV files require explicit column positions ($1, $2, ...).Fix: call
infer_schema()first to detect column names, then generateSELECT $1 AS col1, $2 AS col2, ...instead ofSELECT *.Tests
Type of change
This change is