zstd/DecompressSequences_LdsFseCache: Prefer TG size of 32 instead of 64 and some refactoring.#73
Draft
Jonathan-Weinstein-AMD wants to merge 4 commits intomicrosoft:developmentfrom
Conversation
An input zst file of ~146MB does a Dispatch(15299, 1, 1). On an 7900 XTX (RDNA3) with SetStablePowerState, duration change is 1.811 ms -> 1.441 ms. There could perhaps be more perf testing for this, but A nearby #ifdef suggests a threadgroup size of 32 is also better on RDNA2 XBOX. Current IHV-compiler output seems lacking in some areas, if that is improved the preference of threadgroup size may change.
…bail after LDS stores. Also remove the condition for the GroupMemoryBarrierWithGroupSync. The IHV compiler should not emit a barrier if it isn't needed and be capable or removing empty blocks. s_barrier also behaves like an s_nop for single-wave-threadgroups on AMD. This does not apply to single-wave-threadgroups and DeviceMemoryBarrierWithGroupSync.
…ECODE_SEQ twice. The benefit is less compiled code for easier performance analysis. Performance seems about the same. The ZSTGPU_DECODE_SEQ macro could be removed, but its kept for any future potential experimentation (and to reduce the diff when ignoring whitespace).
…d/ran this at one point before and it was faster for the loop to only have 1 termination condition. Added a comment near the #if 0 about raw buffers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Main changes in this PR:
if (threadId != 0) returnis added. In wave64, that makes the high 32b part of exec 0, so wave64 VMEM/VALU/LDS instructions will then skip that half.ZSTGPU_DECODE_SEQis invoked just once. This is mainly for easier performance analysis; the runtime performance does not seem affected.Unrelated to DecompressSequences is a minor touch-up/comment to the original Huffman literal decoding to remove
zstdgpu_Backward_BitBuffer_V0_CanRefill_Huffman. I tested that when it would compile and it was faster to just have 1 loop exit condition.Diff is best viewed when ignoring whitespace changes.