Skip to content

[Parquet] Add SIMD-accelerated byte-stream-split decoding#654

Draft
daniel-adam-tfs wants to merge 2 commits intoapache:mainfrom
daniel-adam-tfs:feature/optimize-bss
Draft

[Parquet] Add SIMD-accelerated byte-stream-split decoding#654
daniel-adam-tfs wants to merge 2 commits intoapache:mainfrom
daniel-adam-tfs:feature/optimize-bss

Conversation

@daniel-adam-tfs
Copy link
Contributor

Rationale for this change

The byte-stream-split encoding is commonly used in Parquet for floating-point data, as it improves compression ratios by grouping similar bytes together. However, the existing Go implementation uses a simple scalar loop which is inefficient for large datasets. By leveraging SIMD instructions (AVX2 on x86 and NEON on ARM), we can significantly accelerate the decoding process and improve overall Parquet read performance.

What changes are included in this PR?

Optimized implementation of byte-stream split decoding algorithm.

Added SIMD-accelerated implementations:
AVX2 implementation for amd64 architecture using 256-bit vectors processing 32 values per block
NEON implementation for arm64 architecture using 128-bit vectors processing 16 values per block
Both use 2-stage byte unpacking hierarchy following the same algorithm structure
Implemented runtime CPU feature detection with automatic dispatch to the best available implementation (SIMD vs scalar fallback)
Added proper build tags and file suffixes for cross-platform compatibility
Included an optimized V2 scalar implementation using unsafe pointer casting as a fallback

Are these changes tested?

Yes. Various tests were added:

  • Correctness tests covering various input sizes (1, 2, 7, 8, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255, 256, 512, 1024) to validate all implementations (Reference, V2, AVX2, NEON)
  • Edge case tests including exact block boundaries, single values, all-zero data, and all-ones data
  • Benchmark suite with multiple data sizes (8, 64, 512, 4096, 32768, 262144 values) comparing all implementations

Are there any user-facing changes?

No user-facing API changes. This is a performance optimization that maintains full backward compatibility. Users will automatically benefit from faster Parquet decoding when reading files with byte-stream-split encoded floating-point columns, with no code changes required.

- Move all byte-stream-split decoding routines to new file byte_stream_split_decode.go.
- Add architecture-specific SIMD implementations:
  - AVX2 for amd64 (byte_stream_split_decode_avx2_amd64.go/.s)
  - NEON for arm64 (byte_stream_split_decode_neon_arm64.go/.s)
- Add runtime dispatch for SIMD decoding based on CPU features (AVX2/NEON).
- Ensure correct build tags and file suffixes for cross-platform compatibility.
@daniel-adam-tfs
Copy link
Contributor Author

Benchmarks comparison on my Intel macbook:

goos: darwin
goarch: amd64
pkg: github.com/apache/arrow-go/v18/parquet/internal/encoding
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
                                                  │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.main.txt │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.V2.txt │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.AVX2.txt │
                                                  │                          sec/op                          │             sec/op              vs base                │              sec/op               vs base                │
DecodeByteStreamSplitBatchWidth4/nValues=8-8                                                    204.65n ± 0%                      12.78n ± 0%  -93.76% (p=0.000 n=10)                        11.50n ± 1%  -94.38% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=64-8                                                  395.550n ± 1%                     77.810n ± 1%  -80.33% (p=0.000 n=10)                        7.710n ± 1%  -98.05% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=512-8                                                 1425.50n ± 1%                     516.15n ± 1%  -63.79% (p=0.000 n=10)                        34.59n ± 2%  -97.57% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=4096-8                                                 9459.5n ± 2%                     4073.0n ± 1%  -56.94% (p=0.000 n=10)                        226.8n ± 2%  -97.60% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=32768-8                                                73.885µ ± 0%                     32.318µ ± 1%  -56.26% (p=0.000 n=10)                        2.688µ ± 1%  -96.36% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=262144-8                                               593.79µ ± 1%                     261.89µ ± 1%  -55.89% (p=0.000 n=10)                        37.92µ ± 2%  -93.61% (p=0.000 n=10)
geomean                                                                                          6.026µ                           1.614µ       -73.21%                                       203.4n       -96.62%

                                                  │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.main.txt │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.V2.txt │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.AVX2.txt │
                                                  │                           B/s                            │             B/s               vs base                  │              B/s                vs base                  │
DecodeByteStreamSplitBatchWidth4/nValues=8-8                                                    149.1Mi ± 0%                  2387.9Mi ± 0%  +1501.65% (p=0.000 n=10)                    2653.2Mi ± 1%  +1679.58% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=64-8                                                   617.2Mi ± 1%                  3137.6Mi ± 1%   +408.36% (p=0.000 n=10)                   31664.9Mi ± 1%  +5030.38% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=512-8                                                  1.338Gi ± 1%                   3.695Gi ± 1%   +176.19% (p=0.000 n=10)                    55.147Gi ± 2%  +4021.74% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=4096-8                                                 1.613Gi ± 2%                   3.747Gi ± 1%   +132.26% (p=0.000 n=10)                    67.269Gi ± 2%  +4070.21% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=32768-8                                                1.652Gi ± 0%                   3.777Gi ± 1%   +128.62% (p=0.000 n=10)                    45.410Gi ± 1%  +2648.52% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=262144-8                                               1.645Gi ± 1%                   3.729Gi ± 1%   +126.73% (p=0.000 n=10)                    25.752Gi ± 2%  +1465.82% (p=0.000 n=10)
geomean                                                                                         916.7Mi                        3.342Gi        +273.33%                                    26.52Gi       +2862.04%

                                                  │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.main.txt │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.V2.txt │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.AVX2.txt │
                                                  │                           B/op                           │           B/op             vs base                     │            B/op              vs base                     │
DecodeByteStreamSplitBatchWidth4/nValues=8-8                                                      80.00 ± 0%                   0.00 ± 0%  -100.00% (p=0.000 n=10)                         0.00 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=64-8                                                     96.00 ± 0%                   0.00 ± 0%  -100.00% (p=0.000 n=10)                         0.00 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=512-8                                                    112.0 ± 0%                    0.0 ± 0%  -100.00% (p=0.000 n=10)                          0.0 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=4096-8                                                   112.0 ± 0%                    0.0 ± 0%  -100.00% (p=0.000 n=10)                          0.0 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=32768-8                                                  112.0 ± 0%                    0.0 ± 0%  -100.00% (p=0.000 n=10)                          0.0 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=262144-8                                                 112.0 ± 0%                    0.0 ± 0%  -100.00% (p=0.000 n=10)                          0.0 ± 0%  -100.00% (p=0.000 n=10)
geomean                                                                                           103.2                                   ?                       ¹ ²                                ?                       ¹ ²
¹ summaries must be >0 to compute geomean
² ratios must be >0 to compute geomean

                                                  │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.main.txt │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.V2.txt │ bench/BenchmarkDecodeByteStreamSplitBatchWidth4.AVX2.txt │
                                                  │                        allocs/op                         │         allocs/op          vs base                     │          allocs/op           vs base                     │
DecodeByteStreamSplitBatchWidth4/nValues=8-8                                                      1.000 ± 0%                  0.000 ± 0%  -100.00% (p=0.000 n=10)                        0.000 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=64-8                                                     3.000 ± 0%                  0.000 ± 0%  -100.00% (p=0.000 n=10)                        0.000 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=512-8                                                    3.000 ± 0%                  0.000 ± 0%  -100.00% (p=0.000 n=10)                        0.000 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=4096-8                                                   3.000 ± 0%                  0.000 ± 0%  -100.00% (p=0.000 n=10)                        0.000 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=32768-8                                                  3.000 ± 0%                  0.000 ± 0%  -100.00% (p=0.000 n=10)                        0.000 ± 0%  -100.00% (p=0.000 n=10)
DecodeByteStreamSplitBatchWidth4/nValues=262144-8                                                 3.000 ± 0%                  0.000 ± 0%  -100.00% (p=0.000 n=10)                        0.000 ± 0%  -100.00% (p=0.000 n=10)
geomean                                                                                           2.498                                   ?                       ¹ ²                                ?                       ¹ ²

@daniel-adam-tfs
Copy link
Contributor Author

@zeroshade One of our departments has integrated byte-stream-split encoding/decoding into the currently used proprietary format that is used to store data. We did some comparisons and they were getting faster decoding so looked into their code and they were using SIMD implementation with VPUNPCKLBW in C#. I took their fallback and SIMD implementations and fed them to Claude (haven't seen much of assembler since college myself) and it gave me these implementation which are really fast.
The fallback is pretty fast, faster than the current implementation so I'd replace the current implementation with it. And I've copied the file names and build tags for the existing assembly in the repo, so it should be OK to add them.

I've actually wrote a C code first and tried c2goasm with AppleClang and clang21.0 and I couldn't get that to generate me a compilable code.
And I've also tried the new https://pkg.go.dev/simd/archsimd package that we're getting in go1.26, but it doesn't have VPUNPCKLBW wrapper there, so I couldn't get it to be as fast as the Claude generated code.

Anyway, most of our data is float32s or float64s, so the bss decoding function was at the top. After trying processing some files with this change it fell to like to 10th place, memmove and LevelDecoder.Decode are the top 2 now. I think I can do something with both. (I see potential improvements to the level decoding for our case and I should get rid of some memmoves if I at some point figure out the other PR with the buffers. 😆 )

@zeroshade
Copy link
Member

This is awesome, thanks! I agree that if the new fallback is faster then we should replace the current with it. I'm still traveling currently but I'll take a look later this week.

The lint is failing because you need the apache license header on the files.

@zeroshade
Copy link
Member

@daniel-adam-tfs any further work needed here to make it ready for review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants