Open
Conversation
This MR implements efficient eq, neq, distinct, not distinct, gt, lt,
... for 2 RunArrays with the same DataTypes & length.
The idea is to:
1. Compute all values indices where the comparison must be performed.
This is the union of the run-ends
For example, given 2 RunArray with run-end values:
[3, 4, 10]
and [2, 5, 10]
The intersection of their run-ends is
[2, 3, 4, 5, 10]
The corresponding indices of the values array of each RunArray are:
[0, 0, 1, 2, 2]
and [0, 1, 1, 1, 2]
2. Use apply_op_vectored() to perform the operation on the values arrays
at those indices.
3. Finally take nulls into account.
4. Build a BooleanArray from the result + the null mask.
Implementation thoughts:
A. Returning a RunArray instead of a BooleanArray would be interesting.
This can be more efficient: a RunArray (with values being a
BooleanBuffer) would have a length in [1; len(input RunArray) * 2] and
can be efficiently constructed. This would require introducing new pub
functions: distinct_run_array, eq_run_array, etc.
B. The operation is performed on all indices before looking at the
nulls. With sparse (null-heavy) arrays this is wasteful. It might be
worth skipping the computation when either side is null and then
splicing results from non-null and null indices.
C. There's a bit of copy-paste for downcast_primitive_array!() usage. I
could only skip that by introducing a new macro, which didn't seem
desirable.
D. I find the lack of a value type for a fully typed run array annoying.
Array an RunArray<I> are value types, but TypedRunArray<'_, I, V> is a
reference type. This is frustrating. Some type contracts are only
comments, and not enforced by the type system.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This MR implements efficient eq, neq, distinct, not distinct, gt, lt, ... for 2 RunArrays with the same DataTypes & length.
The idea is to:
This is the union of the run-ends
For example, given 2 RunArray with run-end values:
[3, 4, 10]
and [2, 5, 10]
The intersection of their run-ends is
[2, 3, 4, 5, 10]
The corresponding indices of the values array of each RunArray are:
[0, 0, 1, 2, 2]
and [0, 1, 1, 1, 2]
Use apply_op_vectored() to perform the operation on the values arrays at those indices.
Finally take nulls into account.
Build a BooleanArray from the result + the null mask.
Implementation thoughts:
A. Returning a RunArray instead of a BooleanArray would be interesting. This can be more efficient: a RunArray (with values being a BooleanBuffer) would have a length in
[1; len(input RunArray) * 2]and can be efficiently constructed. This would require introducing new pub functions: distinct_run_array, eq_run_array, etc.B. The operation is performed on all indices before looking at the nulls. With sparse (null-heavy) arrays this is wasteful. It might be worth skipping the computation when either side is null and then splicing results from non-null and null indices.
C. There's a bit of copy-paste for downcast_primitive_array!() usage. I could only skip that by introducing a new macro, which didn't seem desirable.
D. I find the lack of a value type for a fully typed run array annoying.
ArrayandRunArray<I>are value types, butTypedRunArray<'_, I, V>is a reference type. This is frustrating. Some type contracts are only comments, and not enforced by the type system.This feature is tracked in #3520.