Skip to content

zstd/DecompressSequences_LdsFseCache: Prefer TG size of 32 instead of 64 and some refactoring.#73

Draft
Jonathan-Weinstein-AMD wants to merge 4 commits intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:decompress-sequences-lds-fse-cleanup
Draft

zstd/DecompressSequences_LdsFseCache: Prefer TG size of 32 instead of 64 and some refactoring.#73
Jonathan-Weinstein-AMD wants to merge 4 commits intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:decompress-sequences-lds-fse-cleanup

Conversation

@Jonathan-Weinstein-AMD
Copy link

@Jonathan-Weinstein-AMD Jonathan-Weinstein-AMD commented Feb 19, 2026

Main changes in this PR:

  1. Prefer TG size of 32 instead of 64 for drivers that advertise both waves32 and wave64 (I think that's just AMD RDNA): An input zst file of ~146MB does a Dispatch(15299, 1, 1) that on an 7900 XTX (RDNA3) with SetStablePowerState has a duration change of 1.811 ms -> 1.441 ms. For other zst files I have, a TG size of 32 seems a bit faster or neutral. A nearby #ifdef suggests a threadgroup size of 32 may also generally better on RDNA2 XBOX. This could be something to revisit later though.
    • EDIT: I think I figured out a reason why TG size of 32 (wave32) can be a lot better than TG size of 64 (wave64 on RDNA3 in this case), and they become much more comparable when the if (threadId != 0) return is added. In wave64, that makes the high 32b part of exec 0, so wave64 VMEM/VALU/LDS instructions will then skip that half.
  2. After LDS stores are done, have all threads but thread 0 bail since the rest of the shader should be scalar.
  3. Restructure the loop condition to be in the middle so ZSTGPU_DECODE_SEQ is invoked just once. This is mainly for easier performance analysis; the runtime performance does not seem affected.

Unrelated to DecompressSequences is a minor touch-up/comment to the original Huffman literal decoding to remove zstdgpu_Backward_BitBuffer_V0_CanRefill_Huffman. I tested that when it would compile and it was faster to just have 1 loop exit condition.

Diff is best viewed when ignoring whitespace changes.

An input zst file of ~146MB does a Dispatch(15299, 1, 1). On an 7900 XTX (RDNA3) with SetStablePowerState, duration change is 1.811 ms -> 1.441 ms.

There could perhaps be more perf testing for this, but A nearby #ifdef suggests a threadgroup size of 32 is also better on RDNA2 XBOX.

Current IHV-compiler output seems lacking in some areas, if that is improved the preference of threadgroup size may change.
…bail after LDS stores. Also remove the condition for the GroupMemoryBarrierWithGroupSync. The IHV compiler should not emit a barrier if it isn't needed and be capable or removing empty blocks. s_barrier also behaves like an s_nop for single-wave-threadgroups on AMD.

This does not apply to single-wave-threadgroups and DeviceMemoryBarrierWithGroupSync.
…ECODE_SEQ twice. The benefit is less compiled code for easier performance analysis. Performance seems about the same. The ZSTGPU_DECODE_SEQ macro could be removed, but its kept for any future potential experimentation (and to reduce the diff when ignoring whitespace).
…d/ran this at one point before and it was faster for the loop to only have 1 termination condition. Added a comment near the #if 0 about raw buffers.
@Jonathan-Weinstein-AMD Jonathan-Weinstein-AMD marked this pull request as draft February 19, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments