Skip to content

Comments

Implement comparisons for RunArray.#9448

Open
brunal wants to merge 6 commits intoapache:mainfrom
brunal:ree-cmp
Open

Implement comparisons for RunArray.#9448
brunal wants to merge 6 commits intoapache:mainfrom
brunal:ree-cmp

Conversation

@brunal
Copy link
Contributor

@brunal brunal commented Feb 20, 2026

This MR implements efficient eq, neq, distinct, not distinct, gt, lt, ... for 2 RunArrays with the same DataTypes & length.

The idea is to:

  1. Compute all values indices where the comparison must be performed.

This is the union of the run-ends

For example, given 2 RunArray with run-end values:
[3, 4, 10]
and [2, 5, 10]

The intersection of their run-ends is
[2, 3, 4, 5, 10]

The corresponding indices of the values array of each RunArray are:
[0, 0, 1, 2, 2]
and [0, 1, 1, 1, 2]

  1. Use apply_op_vectored() to perform the operation on the values arrays at those indices.

  2. Finally take nulls into account.

  3. Build a BooleanArray from the result + the null mask.

Implementation thoughts:

A. Returning a RunArray instead of a BooleanArray would be interesting. This can be more efficient: a RunArray (with values being a BooleanBuffer) would have a length in [1; len(input RunArray) * 2] and can be efficiently constructed. This would require introducing new pub functions: distinct_run_array, eq_run_array, etc.

B. The operation is performed on all indices before looking at the nulls. With sparse (null-heavy) arrays this is wasteful. It might be worth skipping the computation when either side is null and then splicing results from non-null and null indices.

C. There's a bit of copy-paste for downcast_primitive_array!() usage. I could only skip that by introducing a new macro, which didn't seem desirable.

D. I find the lack of a value type for a fully typed run array annoying. Array and RunArray<I> are value types, but TypedRunArray<'_, I, V> is a reference type. This is frustrating. Some type contracts are only comments, and not enforced by the type system.

This feature is tracked in #3520.

This MR implements efficient eq, neq, distinct, not distinct, gt, lt,
... for 2 RunArrays with the same DataTypes & length.

The idea is to:

1. Compute all values indices where the comparison must be performed.

This is the union of the run-ends

For example, given 2 RunArray with run-end values:
      [3, 4, 10]
  and [2, 5, 10]

The intersection of their run-ends is
      [2, 3, 4, 5, 10]

The corresponding indices of the values array of each RunArray are:
      [0, 0, 1, 2, 2]
  and [0, 1, 1, 1, 2]

2. Use apply_op_vectored() to perform the operation on the values arrays
   at those indices.

3. Finally take nulls into account.

4. Build a BooleanArray from the result + the null mask.

Implementation thoughts:

A. Returning a RunArray instead of a BooleanArray would be interesting.
This can be more efficient: a RunArray (with values being a
BooleanBuffer) would have a length in [1; len(input RunArray) * 2] and
can be efficiently constructed. This would require introducing new pub
functions: distinct_run_array, eq_run_array, etc.

B. The operation is performed on all indices before looking at the
nulls. With sparse (null-heavy) arrays this is wasteful. It might be
worth skipping the computation when either side is null and then
splicing results from non-null and null indices.

C. There's a bit of copy-paste for downcast_primitive_array!() usage. I
could only skip that by introducing a new macro, which didn't seem
desirable.

D. I find the lack of a value type for a fully typed run array annoying.
Array an RunArray<I> are value types, but TypedRunArray<'_, I, V> is a
reference type. This is frustrating. Some type contracts are only
comments, and not enforced by the type system.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 20, 2026
@brunal brunal marked this pull request as ready for review February 21, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant