Skip to content

Conversation

@AntoinePrv
Copy link
Contributor

@AntoinePrv AntoinePrv commented Oct 29, 2025

Rationale for this change

What changes are included in this PR?

  • Add a new method for building unpacking kernels.
    The constexpr code generation creates a kernel appropriate for a given input/output bit width and simd size.
  • I have included a number of xsimd fallback that have been merged upstream.
  • I have run extensive benchmarks and re-dispatched among different sizes on specific architectures when it was not performing well.
  • The biggest win here is SSE4.2, though AVX2 improves too.
  • This is not built/tested for AVX512, though there are not really limitation. Currently the arch detection between all the avx512 is not consistent and sometimes error. I would need to investigate with the upcoming xsimd release.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@AntoinePrv AntoinePrv force-pushed the new-bpacking branch 4 times, most recently from d2743d4 to 6e72467 Compare October 30, 2025 15:58
@AntoinePrv AntoinePrv force-pushed the new-bpacking branch 2 times, most recently from a7e4cd9 to 9efa59a Compare November 20, 2025 17:19
@AntoinePrv AntoinePrv force-pushed the new-bpacking branch 2 times, most recently from d01fdba to b28ea9b Compare November 27, 2025 09:57
@AntoinePrv AntoinePrv changed the title unpack with shuffle algorithm GH-48277: [C++][Parquet] unpack with shuffle algorithm Nov 27, 2025
@github-actions
Copy link

⚠️ GitHub issue #48277 has been automatically assigned in GitHub to PR creator.

@AntoinePrv AntoinePrv marked this pull request as ready for review November 27, 2025 14:03
@AntoinePrv
Copy link
Contributor Author

@pitrou apart from R-lint, this is looking pretty good.

@pitrou
Copy link
Member

pitrou commented Nov 27, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit a4bfe8a. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a4bfe8a.

There were 37 benchmark results indicating a performance regression:

The full Conbench report has more details.

@AntoinePrv
Copy link
Contributor Author

AntoinePrv commented Nov 28, 2025

@pitrou I'm running this locally, and I made an error when fixing ASAN over-reading problem.
These latest benchmarks are not doing well.

@pitrou
Copy link
Member

pitrou commented Nov 28, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit dd3ec0d. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit dd3ec0d.

There were 19 benchmark results indicating a performance regression:

The full Conbench report has more details.

@pitrou
Copy link
Member

pitrou commented Dec 1, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit 408ef04. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Member

pitrou commented Dec 2, 2025

@ursabot please benchmark lang=C++

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

@AntoinePrv AntoinePrv requested a review from pitrou February 10, 2026 15:28
@pitrou pitrou added the CI: Extra: C++ Run extra C++ CI label Feb 10, 2026
@AntoinePrv
Copy link
Contributor Author

@pitrou ready again

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me, just a couple minor comments!

template void unpack<uint16_t>(const uint8_t*, uint16_t*, int, int, int);
template void unpack<uint32_t>(const uint8_t*, uint32_t*, int, int, int);
template void unpack<uint64_t>(const uint8_t*, uint64_t*, int, int, int);
template void unpack<bool>(const uint8_t*, bool*, const UnpackOptions&);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why not put all the unpack-related APIs inside arrow::internal::bpacking as well? Does it cause too much code churn, or would it fail for other reasons?

/// |1|3|5|7|
template <typename ToInt, int kOffset, typename Int, typename Arch, Int... kShifts>
constexpr auto select_stride(xsimd::batch_constant<Int, Arch, kShifts...>) {
constexpr auto kStridesArr =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to assert that kOffset < sizeof(ToInt) / sizeof(Int)?

template <typename Arch>
constexpr bool IsSse2 = std::is_base_of_v<xsimd::sse2, Arch>;

/// Whether we are compiling for the AVX2 or above in the SSE family.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's something "above". AVX-512 would have IsAvx2 = false, right?

return array_to_batch_constant<kStridesArr, Arch>();
}

/// Whether we are compiling for the SSE2 or above in the SSE family.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Whether we are compiling for the SSE2 or above in the SSE family.
/// Whether we are compiling for the SSE2 or above in the SSE family (but not AVX2 / AVX-512)

///
/// Resulting in the following swizzled register.
///
/// |AAABBBCC|CDDDEEEF|AAABBBCC|CDDDEEEF||AAABBBCC|CDDDEEEF|CDDDEEEF|FFGGGHHH|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice the two 32-bit halves are slightly different but I'm not sure why, since we only care about keeping AAA, CCC, BBB and DDD. Why the difference between CDDDEEEF (first) and FFGGGHHH (second), for example?

/// write to memory.
/// Now we can do the same with C and D without having to go to another swizzle (doing
/// multiple shifts per swizzle).
/// The next value to unpack is E, which starts on bit 12 (not byte aligned).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

B, C, D were not byte aligned either, so I don't understand what makes E special.

/// account that we have an extra offset of 12 bits (this will change all the constants
/// used for swizzle and shift).
///
/// Note that in this example, we could have swizzled more than two values in each slot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence my question above :)

/// Note that in this example, we could have swizzled more than two values in each slot.
/// In practice, there may be situations where a more complex algorithm could fit more
/// shifts per swizzles, but that tend to not be the case when the SIMD register size
/// increases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a followup task, would be interesting to investigate in which (bit width, SIMD size) combinations a better algorithm could spare some swizzles.

/// We then apply different shifts to align the values bits. The low values are
/// right-shifted by the bit offset, and the high values are left-shifted to fill the gap.
/// Note that because of the figure is little endian, right (>>) and left (<<) shift
/// operators actually shifts the bits representation in the oppostite directions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// operators actually shifts the bits representation in the oppostite directions.
/// operators actually shift the bits representation in the opposite directions.

/// bitwise OR of both parts:
///
/// |AAAAAAAA|AAAAABBB||BBBBBBBBB|BB00000|
/// & |00000000|AAAAABBB||000BBBBB|BBBBBCCC|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wrote & but it's |, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting committer review Awaiting committer review CI: Extra: C++ Run extra C++ CI Component: C++

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants