feat: add manifest namespace extended properties#6215
feat: add manifest namespace extended properties#6215wojiaodoubao wants to merge 1 commit intolance-format:mainfrom
Conversation
PR Review: feat: add manifest namespace extended propertiesPerformance:
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
wjones127
left a comment
There was a problem hiding this comment.
I'm feeling a little uncertain about the efficiency of this code, particularly around the handling of batches. Though I also think the bot is right about projecting many columns.
| return Err(Error::io(format!( | ||
| "Extended properties key {} must start with prefix: {}", | ||
| name, EXTENDED_PREFIX | ||
| ))); |
There was a problem hiding this comment.
This error isn't about IO, but is incorrect input:
| return Err(Error::io(format!( | |
| "Extended properties key {} must start with prefix: {}", | |
| name, EXTENDED_PREFIX | |
| ))); | |
| return Err(Error::invalid_input(format!( | |
| "Extended properties key {} must start with prefix: {}", | |
| name, EXTENDED_PREFIX | |
| ))); |
| }) | ||
| .collect::<Result<Vec<_>>>()? | ||
| .into_iter() | ||
| .filter(|f| full_schema.column_with_name(f.name()).is_none()) |
There was a problem hiding this comment.
We just throw out names that collide? Why not error? Or at least warn?
Or is this just making the migration idempotent? Even if this is the case, I'm curious if there are other columns that it could collide with that we should disallow.
| .filter(|f| full_schema.column_with_name(f.name()).is_none()) | ||
| .collect(); | ||
|
|
||
| let schema = Schema::new(fields); |
There was a problem hiding this comment.
Should we early return if fields is empty?
| /// Parse one row of __manifest table into manifest object. | ||
| fn parse_manifest_object(batch: &RecordBatch) -> Result<ManifestObject> { |
There was a problem hiding this comment.
I agree with the bot comment, instead of creating slice just pass in the index you want to extract:
| /// Parse one row of __manifest table into manifest object. | |
| fn parse_manifest_object(batch: &RecordBatch) -> Result<ManifestObject> { | |
| /// Parse one row of __manifest table into manifest object. | |
| fn parse_manifest_object(batch: &RecordBatch, index: usize) -> Result<ManifestObject> { |
| fn build_extended_column_array( | ||
| col_name: &str, | ||
| data_type: &DataType, | ||
| extended_records: &[Option<RecordBatch>], | ||
| props_vec: &[Option<HashMap<String, String>>], | ||
| ) -> Result<ArrayRef> { |
There was a problem hiding this comment.
I'm a bit confused as to how this is implemented. It seems like sometimes you are using RecordBatch as a row, which is very inefficient.
Perhaps you should write some function like:
fn records_to_batch(
records: &HashMap<String, String>,
schema: &Schema
) -> Result<RecordBatch> { ... }And then you can do the merges with other data. in a columnar fashion. I imagine that function might be useful in multiple places.
This is a sub-task of #5741