Skip to content

feat: add manifest namespace extended properties#6215

Open
wojiaodoubao wants to merge 1 commit intolance-format:mainfrom
wojiaodoubao:manifest-extended-properties
Open

feat: add manifest namespace extended properties#6215
wojiaodoubao wants to merge 1 commit intolance-format:mainfrom
wojiaodoubao:manifest-extended-properties

Conversation

@wojiaodoubao
Copy link
Contributor

This is a sub-task of #5741

@github-actions github-actions bot added enhancement New feature or request python java labels Mar 17, 2026
@github-actions
Copy link
Contributor

PR Review: feat: add manifest namespace extended properties

Performance: query_manifest reads all columns (P1)

The new unified query_manifest() method removes the .project() call that the old query_manifest_for_table and query_manifest_for_namespace had. Previously, table lookups only read object_id + location, and namespace lookups only read object_id + metadata. Now every query reads all columns including all extended property columns.

As the number of extended properties grows, this becomes increasingly wasteful for simple existence checks and lookups. Consider keeping a projection — at minimum project the basic columns plus the extended columns, or re-add targeted projections for the two call sites.

Performance: Row-at-a-time slicing in query_manifest (P1)

for row_idx in 0..batch.num_rows() {
    let sliced_columns: Vec<Arc<dyn Array>> = batch
        .columns()
        .iter()
        .map(|col| col.slice(row_idx, 1))
        .collect();
    let row = RecordBatch::try_new(batch.schema(), sliced_columns)?;
    objects.push(parse_manifest_object(&row)?);
}

This creates a new RecordBatch per row. For queries that match many rows (e.g., listing all tables), this is O(rows × columns) allocations. Consider indexing into the batch directly in parse_manifest_object using a row index parameter instead of slicing.

Bug: Unused f32, f64 imports (P0 — clippy will fail)

use std::{
    collections::HashMap,
    f32, f64,
    hash::{DefaultHasher, Hash, Hasher},

These module imports (std::f32, std::f64) appear unused. This will likely trigger a clippy warning/error with -D warnings.

Bug: is_multiple_of requires nightly (P0)

if !s.len().is_multiple_of(2) {

usize::is_multiple_of is a nightly-only API (int_roundings feature). This won't compile on stable Rust. Use s.len() % 2 != 0 instead.

Design: Duplicated merge logic (P1)

The pattern of matching (extended_record, &request.properties) with 4 arms is copy-pasted across create_namespace_extended, declare_table_extended, and create_table_extended. Consider extracting a helper like fn resolve_extended_batch(...) to reduce duplication and the risk of the branches diverging.

Minor

  • build_metadata_json silently drops properties that fail to serialize (.ok()? on serde_json::to_string). HashMap<String, String> serialization cannot fail in practice, but the silent swallow is worth noting.
  • Comment typo: "Collect basic columns to excluded" → "to exclude"

🤖 Generated with Claude Code

@wojiaodoubao wojiaodoubao mentioned this pull request Mar 17, 2026
7 tasks
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling a little uncertain about the efficiency of this code, particularly around the handling of batches. Though I also think the bot is right about projecting many columns.

Comment on lines +415 to +418
return Err(Error::io(format!(
"Extended properties key {} must start with prefix: {}",
name, EXTENDED_PREFIX
)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error isn't about IO, but is incorrect input:

Suggested change
return Err(Error::io(format!(
"Extended properties key {} must start with prefix: {}",
name, EXTENDED_PREFIX
)));
return Err(Error::invalid_input(format!(
"Extended properties key {} must start with prefix: {}",
name, EXTENDED_PREFIX
)));

})
.collect::<Result<Vec<_>>>()?
.into_iter()
.filter(|f| full_schema.column_with_name(f.name()).is_none())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just throw out names that collide? Why not error? Or at least warn?

Or is this just making the migration idempotent? Even if this is the case, I'm curious if there are other columns that it could collide with that we should disallow.

.filter(|f| full_schema.column_with_name(f.name()).is_none())
.collect();

let schema = Schema::new(fields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we early return if fields is empty?

Comment on lines +1275 to +1276
/// Parse one row of __manifest table into manifest object.
fn parse_manifest_object(batch: &RecordBatch) -> Result<ManifestObject> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the bot comment, instead of creating slice just pass in the index you want to extract:

Suggested change
/// Parse one row of __manifest table into manifest object.
fn parse_manifest_object(batch: &RecordBatch) -> Result<ManifestObject> {
/// Parse one row of __manifest table into manifest object.
fn parse_manifest_object(batch: &RecordBatch, index: usize) -> Result<ManifestObject> {

Comment on lines +2464 to +2469
fn build_extended_column_array(
col_name: &str,
data_type: &DataType,
extended_records: &[Option<RecordBatch>],
props_vec: &[Option<HashMap<String, String>>],
) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to how this is implemented. It seems like sometimes you are using RecordBatch as a row, which is very inefficient.

Perhaps you should write some function like:

fn records_to_batch(
    records: &HashMap<String, String>,
    schema: &Schema
) -> Result<RecordBatch> { ... }

And then you can do the merges with other data. in a columnar fashion. I imagine that function might be useful in multiple places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants