Skip to content

Fixes for CPU backend + instructions for targetting AMD GPUs#5

Open
philtomson wants to merge 2 commits intoPrismML-Eng:prismfrom
philtomson:prism
Open

Fixes for CPU backend + instructions for targetting AMD GPUs#5
philtomson wants to merge 2 commits intoPrismML-Eng:prismfrom
philtomson:prism

Conversation

@philtomson
Copy link
Copy Markdown

This PR fixes a CPU kernel bug where a float was converted to an int (0.4 -> 0) thus causing the CPU backend to not work for the Bonsai 1-bit models. Support was added for AVX2 and AVX512. Also adds info in the README.md for compiling for ROCm for AMD GPUs.

philtomson and others added 2 commits April 2, 2026 08:08
Fix float-truncation bug in ggml_vec_dot_q1_0_g128_q8_0: the Q8_0
scale factor d1 was accumulated into an int, silently truncating it
to zero for most blocks and producing garbage output on CPU.

Add AVX2 kernel using the same shuffle/bit-mask expansion as q1_0,
processing one 32-element Q8_0 sub-block per iteration (~4x speedup).

Add AVX-512 hybrid kernel using _mm256_mask_blend_epi8 (AVX-512BW+VL)
to expand weight bits in a single instruction, with _mm256_dpbusd_epi32
(AVX-512VNNI) for the dot product (~5.7x speedup on 1.7B model).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document how to build with GGML_HIP=ON targeting gfx1151 (Radeon 8060S /
Ryzen AI MAX+), including the Docker-based workflow for systems where the
system ROCm installation is partial.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the ggml label Apr 2, 2026
@khosravipasha
Copy link
Copy Markdown
Collaborator

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

The AMD one looks great, did not realize you can build the CUDA version for AMD, how are the speeds?

@philtomson
Copy link
Copy Markdown
Author

philtomson commented Apr 3, 2026

I'm getting about 55 tok/sec with the 8B model on my Framework desktop PC (Strix Halo). That's about half of what I've seen someone report getting with a 4090 - probably about right given the slower memory bandwidth since this whole model can easily fit in a 4090.

@philtomson
Copy link
Copy Markdown
Author

philtomson commented Apr 3, 2026

As for the KL divergence numbers:

KL Divergence Results: Bonsai-1.7B on wikitext-2-raw (100 chunks, ctx=512)

Setup: F16 reference logits compared against Q1_0_g128 and Q1_0 quantizations.

Perplexity

Model PPL PPL ratio vs F16
F16 (reference) 24.203 1.000×
Q1_0_g128 24.262 1.0024×
Q1_0 24.220 1.0007×

KL Divergence vs F16

Metric Q1_0_g128 Q1_0
Mean KLD 0.000645 0.000844
Median KLD 0.000500 0.000680
Maximum KLD 0.013948 0.020355
99th percentile KLD 0.003021 0.003804
Mean Δp 0.002% -0.049%
Max Δp 7.80% 6.62%

Numbers seem higher than your NEON numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants