Skip to content

huf_decompress: enable 4-way fast loop on riscv64#4622

Open
Polaris-911 wants to merge 1 commit intofacebook:devfrom
Polaris-911:pr1
Open

huf_decompress: enable 4-way fast loop on riscv64#4622
Polaris-911 wants to merge 1 commit intofacebook:devfrom
Polaris-911:pr1

Conversation

@Polaris-911
Copy link
Contributor

Summary

  • In lib/decompress/huf_decompress.c, enable HUF_4X2_4WAY for riscv64 (defined(__riscv) && __riscv_xlen == 64).
  • Keep existing logic unchanged for all other architectures.
  • No functional/format changes: only changes fast-loop scheduling policy in the hot decode path.

Why this change is reasonable

  • The HUF_4X2_4WAY switch controls whether the fast loop decodes all 4 streams in the main unrolled body, or uses a 3-way fallback to reduce register pressure.
  • This loop is already enabled as 4-way on aarch64, which indicates the 4-way variant is considered stable and beneficial on modern 64-bit RISC-style cores with larger register files.
  • riscv64 has similar 64-bit execution characteristics for this scalar path (64-bit container ops, stream-parallel decode pattern), so selecting 4-way is a reasonable architecture-level default.
  • Scope is intentionally conservative: only riscv64 is opted in; rv32 and all non-RISC-V targets keep existing behavior.
  • If performance differs by compiler/micro-architecture, this change is still safe functionally because both paths implement the same decode semantics.

@meta-cla meta-cla bot added the CLA Signed label Mar 12, 2026
@terrelln
Copy link
Contributor

Do you have benchmarks for this?

@Polaris-911
Copy link
Contributor Author

Benchmark Results (silesia.tar, level 1)

Hi @terrelln Here are the benchmark results comparing the baseline zstd against zstd_huf with #define HUF_4X2_4WAY 1 enabled across 5 iterations:

Run zstd Comp (MB/s) zstd Decomp (MB/s) zstd_huf Comp (MB/s) zstd_huf Decomp (MB/s)
1 91.3 235.3 91.8 235.8
2 92.1 227.3 90.6 231.6
3 91.2 236.4 91.4 235.5
4 92.2 233.0 91.7 236.8
5 91.5 234.7 91.5 237.6
Average 91.66 233.34 91.40 235.46
Difference - - -0.28% +0.91%

Summary:
After setting #define HUF_4X2_4WAY 1, the decompression speed shows a consistent improvement (+0.91%). The slight variance in compression speed (-0.28%) is within the normal margin of error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants