Background
Commit 23d0e8aa0 introduced Arrow memory-backed tables (ArrowNodeTable) to enable zero-copy data sharing with Arrow ecosystem. However, the current implementation has performance limitations compared to native Ladybug tables.
Current Implementation Issues
The current ArrowNodeTable implementation (see src/storage/table/arrow_node_table.cpp):
- Converts all Arrow data upfront into
std::vector<std::unique_ptr<common::Value>> (see readArrowTableData())
- Returns one row at a time during scan (row-at-a-time processing)
- Does NOT support semi-masks for filtering during scan
- Does NOT utilize selection vectors for efficient batch filtering
- Missing SIP (Sideways Information Passing) optimization for joins
Optimization Tasks
1. Support SIP (Sideways Information Passing) / Semi-Mask Filtering
Problem: Arrow tables currently ignore semi-masks set on TableScanState::semiMask, which means:
- Hash joins cannot push down semi-join filters to Arrow table scans
- Subquery unnesting cannot benefit from semi-mask-based filtering
- Missing significant join optimization opportunities
Implementation:
- In
ArrowNodeTable::scanInternal(), check if scanState.semiMask is enabled
- Apply semi-mask filtering similar to
NodeGroup::applySemiMaskFilter() (see src/storage/table/node_group.cpp:186-213)
- Filter rows based on
semiMask->range(startOffset, endOffset) before returning data
- Skip rows that don't match the semi-mask to avoid unnecessary data conversion
Key files:
src/storage/table/arrow_node_table.cpp
src/include/common/mask.h (SemiMask interface)
src/storage/table/node_group.cpp (reference implementation)
2. Support Selection Vectors
Problem: The current implementation always returns a selection vector with size 1 (setSelSize(1)), ignoring the vectorized processing capabilities of Ladybug.
Implementation:
- Process data in batches (up to
DEFAULT_VECTOR_CAPACITY rows) instead of one row at a time
- Populate selection vector with valid row indices after applying filters
- Support predicate pushdown through selection vectors
- Handle both flat and unflat output vectors properly
Key files:
src/include/common/data_chunk/sel_vector.h (SelectionVector interface)
src/processor/operator/scan/scan_node_table.cpp (how scan uses selection vectors)
3. Support for ASP (Asymmetric Semi-join with Predicate) Joins
Problem: Arrow tables cannot participate in asymmetric semi-join optimizations where:
- Build side is much smaller than probe side
- Semi-join predicate can be pushed down to the Arrow table scan
- Early filtering could significantly reduce data transfer
Implementation:
- Ensure Arrow tables properly expose row count statistics for optimizer decisions
- Support predicate pushdown through
TableScanState::columnPredicateSets
- Enable early termination when semi-join condition is satisfied
Key files:
src/include/storage/predicate/column_predicate.h
src/include/planner/operator/sip/side_way_info_passing.h (SIPInfo)
src/processor/operator/hash_join/hash_join_probe.h
4. Batch Processing Instead of Row-at-a-Time
Problem: Current implementation converts all data upfront then returns one row per scan call, which is inefficient.
Implementation:
- Process Arrow data in columnar batches (up to
DEFAULT_VECTOR_CAPACITY)
- Read data directly from Arrow arrays into Ladybug vectors without intermediate
Value objects
- Use Arrow's chunked array support for memory-efficient processing
- Implement lazy/partial data reading instead of full table materialization
Key files:
src/common/arrow/arrow_converter.h (Arrow-to-Ladybug conversion utilities)
src/include/common/types/value/value.h
5. Zero-Copy Data Access (Bonus/Advanced)
Problem: Converting Arrow arrays to Ladybug Value objects is expensive and loses Arrow's columnar benefits.
Implementation:
- For compatible Arrow types (primitive types), access Arrow buffers directly
- Implement custom
ValueVector backed by Arrow array buffers
- Use Arrow's C Data Interface for efficient data sharing
- Avoid
Value object creation for simple types
Key files:
src/include/common/arrow/arrow.h
src/common/arrow/arrow_converter.cpp
Acceptance Criteria
- Semi-mask support: Arrow tables should respect semi-masks set by hash joins and skip non-matching rows
- Selection vectors: Arrow scans should populate selection vectors with up to
DEFAULT_VECTOR_CAPACITY rows per call
- Performance: Queries with semi-joins on Arrow tables should show measurable performance improvement compared to current row-at-a-time processing
- Correctness: All existing tests in
test/api/arrow_node_table_test.cpp should continue to pass
- New tests: Add tests demonstrating:
- Semi-mask filtering on Arrow tables
- Batch processing with selection vectors
- Join queries with SIP optimization enabled
References
- Native semi-mask filtering:
src/storage/table/node_group.cpp:186-213
- Selection vector usage:
src/include/common/data_chunk/sel_vector.h
- SIP planning:
src/include/planner/operator/sip/side_way_info_passing.h
- Current Arrow implementation:
src/storage/table/arrow_node_table.cpp
- Semi-masker operator:
src/include/processor/operator/semi_masker.h
- Hash join probe:
src/include/processor/operator/hash_join/hash_join_probe.h
Priority
- Semi-mask support (SIP) - HIGH
- Selection vector batch processing - HIGH
- Predicate pushdown / ASP joins - MEDIUM
- Zero-copy data access - LOW (advanced optimization)
Background
Commit
23d0e8aa0introduced Arrow memory-backed tables (ArrowNodeTable) to enable zero-copy data sharing with Arrow ecosystem. However, the current implementation has performance limitations compared to native Ladybug tables.Current Implementation Issues
The current
ArrowNodeTableimplementation (seesrc/storage/table/arrow_node_table.cpp):std::vector<std::unique_ptr<common::Value>>(seereadArrowTableData())Optimization Tasks
1. Support SIP (Sideways Information Passing) / Semi-Mask Filtering
Problem: Arrow tables currently ignore semi-masks set on
TableScanState::semiMask, which means:Implementation:
ArrowNodeTable::scanInternal(), check ifscanState.semiMaskis enabledNodeGroup::applySemiMaskFilter()(seesrc/storage/table/node_group.cpp:186-213)semiMask->range(startOffset, endOffset)before returning dataKey files:
src/storage/table/arrow_node_table.cppsrc/include/common/mask.h(SemiMask interface)src/storage/table/node_group.cpp(reference implementation)2. Support Selection Vectors
Problem: The current implementation always returns a selection vector with size 1 (
setSelSize(1)), ignoring the vectorized processing capabilities of Ladybug.Implementation:
DEFAULT_VECTOR_CAPACITYrows) instead of one row at a timeKey files:
src/include/common/data_chunk/sel_vector.h(SelectionVector interface)src/processor/operator/scan/scan_node_table.cpp(how scan uses selection vectors)3. Support for ASP (Asymmetric Semi-join with Predicate) Joins
Problem: Arrow tables cannot participate in asymmetric semi-join optimizations where:
Implementation:
TableScanState::columnPredicateSetsKey files:
src/include/storage/predicate/column_predicate.hsrc/include/planner/operator/sip/side_way_info_passing.h(SIPInfo)src/processor/operator/hash_join/hash_join_probe.h4. Batch Processing Instead of Row-at-a-Time
Problem: Current implementation converts all data upfront then returns one row per scan call, which is inefficient.
Implementation:
DEFAULT_VECTOR_CAPACITY)ValueobjectsKey files:
src/common/arrow/arrow_converter.h(Arrow-to-Ladybug conversion utilities)src/include/common/types/value/value.h5. Zero-Copy Data Access (Bonus/Advanced)
Problem: Converting Arrow arrays to Ladybug
Valueobjects is expensive and loses Arrow's columnar benefits.Implementation:
ValueVectorbacked by Arrow array buffersValueobject creation for simple typesKey files:
src/include/common/arrow/arrow.hsrc/common/arrow/arrow_converter.cppAcceptance Criteria
DEFAULT_VECTOR_CAPACITYrows per calltest/api/arrow_node_table_test.cppshould continue to passReferences
src/storage/table/node_group.cpp:186-213src/include/common/data_chunk/sel_vector.hsrc/include/planner/operator/sip/side_way_info_passing.hsrc/storage/table/arrow_node_table.cppsrc/include/processor/operator/semi_masker.hsrc/include/processor/operator/hash_join/hash_join_probe.hPriority