zstd/DecompressSequences_LdsFseCache: Prefer TG size of 32 instead of 64 and some refactoring. by Jonathan-Weinstein-AMD · Pull Request #73 · microsoft/DirectStorage

Jonathan-Weinstein-AMD · 2026-02-19T04:28:57Z

Main changes in this PR:

Prefer TG size of 32 instead of 64 for drivers that advertise both waves32 and wave64 (I think that's just AMD RDNA): An input zst file of ~146MB does a Dispatch(15299, 1, 1) that on an 7900 XTX (RDNA3) with SetStablePowerState has a duration change of 1.811 ms -> 1.441 ms. For other zst files I have, a TG size of 32 seems a bit faster or neutral. A nearby #ifdef suggests a threadgroup size of 32 may also generally better on RDNA2 XBOX. This could be something to revisit later though.
- EDIT: I think I figured out a reason why TG size of 32 (wave32) can be a lot better than TG size of 64 (wave64 on RDNA3 in this case), and they become much more comparable when the if (threadId != 0) return is added. In wave64, that makes the high 32b part of exec 0, so wave64 VMEM/VALU/LDS instructions will then skip that half.
After LDS stores are done, have all threads but thread 0 bail since the rest of the shader should be scalar.
Restructure the loop condition to be in the middle so ZSTGPU_DECODE_SEQ is invoked just once. This is mainly for easier performance analysis; the runtime performance does not seem affected.

Unrelated to DecompressSequences is a minor touch-up/comment to the original Huffman literal decoding to remove zstdgpu_Backward_BitBuffer_V0_CanRefill_Huffman. I tested that when it would compile and it was faster to just have 1 loop exit condition.

Diff is best viewed when ignoring whitespace changes.

An input zst file of ~146MB does a Dispatch(15299, 1, 1). On an 7900 XTX (RDNA3) with SetStablePowerState, duration change is 1.811 ms -> 1.441 ms. There could perhaps be more perf testing for this, but A nearby #ifdef suggests a threadgroup size of 32 is also better on RDNA2 XBOX. Current IHV-compiler output seems lacking in some areas, if that is improved the preference of threadgroup size may change.

…bail after LDS stores. Also remove the condition for the GroupMemoryBarrierWithGroupSync. The IHV compiler should not emit a barrier if it isn't needed and be capable or removing empty blocks. s_barrier also behaves like an s_nop for single-wave-threadgroups on AMD. This does not apply to single-wave-threadgroups and DeviceMemoryBarrierWithGroupSync.

…ECODE_SEQ twice. The benefit is less compiled code for easier performance analysis. Performance seems about the same. The ZSTGPU_DECODE_SEQ macro could be removed, but its kept for any future potential experimentation (and to reduce the diff when ignoring whitespace).

…d/ran this at one point before and it was faster for the loop to only have 1 termination condition. Added a comment near the #if 0 about raw buffers.

Jonathan-Weinstein-AMD added 4 commits February 18, 2026 19:35

zstd: Remove zstdgpu_Backward_BitBuffer_V0_CanRefill_Huffman. I teste…

f4dbb31

…d/ran this at one point before and it was faster for the loop to only have 1 termination condition. Added a comment near the #if 0 about raw buffers.

Jonathan-Weinstein-AMD marked this pull request as draft February 19, 2026 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd/DecompressSequences_LdsFseCache: Prefer TG size of 32 instead of 64 and some refactoring.#73

zstd/DecompressSequences_LdsFseCache: Prefer TG size of 32 instead of 64 and some refactoring.#73
Jonathan-Weinstein-AMD wants to merge 4 commits intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:decompress-sequences-lds-fse-cleanup

Jonathan-Weinstein-AMD commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Jonathan-Weinstein-AMD commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Jonathan-Weinstein-AMD commented Feb 19, 2026 •

edited

Loading