Fixes for CPU backend + instructions for targetting AMD GPUs#5
Open
philtomson wants to merge 2 commits intoPrismML-Eng:prismfrom
Open
Fixes for CPU backend + instructions for targetting AMD GPUs#5philtomson wants to merge 2 commits intoPrismML-Eng:prismfrom
philtomson wants to merge 2 commits intoPrismML-Eng:prismfrom
Conversation
Fix float-truncation bug in ggml_vec_dot_q1_0_g128_q8_0: the Q8_0 scale factor d1 was accumulated into an int, silently truncating it to zero for most blocks and producing garbage output on CPU. Add AVX2 kernel using the same shuffle/bit-mask expansion as q1_0, processing one 32-element Q8_0 sub-block per iteration (~4x speedup). Add AVX-512 hybrid kernel using _mm256_mask_blend_epi8 (AVX-512BW+VL) to expand weight bits in a single instruction, with _mm256_dpbusd_epi32 (AVX-512VNNI) for the dot product (~5.7x speedup on 1.7B model). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document how to build with GGML_HIP=ON targeting gfx1151 (Radeon 8060S / Ryzen AI MAX+), including the Docker-based workflow for systems where the system ROCm installation is partial. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
|
This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones. Could you run the KL divergence tests described here: #8 The AMD one looks great, did not realize you can build the CUDA version for AMD, how are the speeds? |
Author
|
I'm getting about 55 tok/sec with the 8B model on my Framework desktop PC (Strix Halo). That's about half of what I've seen someone report getting with a 4090 - probably about right given the slower memory bandwidth since this whole model can easily fit in a 4090. |
Author
|
As for the KL divergence numbers: KL Divergence Results: Bonsai-1.7B on wikitext-2-raw (100 chunks, ctx=512)Setup: F16 reference logits compared against Q1_0_g128 and Q1_0 quantizations. Perplexity
KL Divergence vs F16
Numbers seem higher than your NEON numbers. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a CPU kernel bug where a float was converted to an int (0.4 -> 0) thus causing the CPU backend to not work for the Bonsai 1-bit models. Support was added for AVX2 and AVX512. Also adds info in the README.md for compiling for ROCm for AMD GPUs.