Fixes for CPU backend + instructions for targetting AMD GPUs by philtomson · Pull Request #5 · PrismML-Eng/llama.cpp

philtomson · 2026-04-02T16:52:32Z

This PR fixes a CPU kernel bug where a float was converted to an int (0.4 -> 0) thus causing the CPU backend to not work for the Bonsai 1-bit models. Support was added for AVX2 and AVX512. Also adds info in the README.md for compiling for ROCm for AMD GPUs.

Fix float-truncation bug in ggml_vec_dot_q1_0_g128_q8_0: the Q8_0 scale factor d1 was accumulated into an int, silently truncating it to zero for most blocks and producing garbage output on CPU. Add AVX2 kernel using the same shuffle/bit-mask expansion as q1_0, processing one 32-element Q8_0 sub-block per iteration (~4x speedup). Add AVX-512 hybrid kernel using _mm256_mask_blend_epi8 (AVX-512BW+VL) to expand weight bits in a single instruction, with _mm256_dpbusd_epi32 (AVX-512VNNI) for the dot product (~5.7x speedup on 1.7B model). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Document how to build with GGML_HIP=ON targeting gfx1151 (Radeon 8060S / Ryzen AI MAX+), including the Docker-based workflow for systems where the system ROCm installation is partial. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

khosravipasha · 2026-04-02T18:44:00Z

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

The AMD one looks great, did not realize you can build the CUDA version for AMD, how are the speeds?

philtomson · 2026-04-03T01:46:51Z

I'm getting about 55 tok/sec with the 8B model on my Framework desktop PC (Strix Halo). That's about half of what I've seen someone report getting with a 4090 - probably about right given the slower memory bandwidth since this whole model can easily fit in a 4090.

philtomson · 2026-04-03T03:46:29Z

As for the KL divergence numbers:

KL Divergence Results: Bonsai-1.7B on wikitext-2-raw (100 chunks, ctx=512)

Setup: F16 reference logits compared against Q1_0_g128 and Q1_0 quantizations.

Perplexity

Model	PPL	PPL ratio vs F16
F16 (reference)	24.203	1.000×
Q1_0_g128	24.262	1.0024×
Q1_0	24.220	1.0007×

KL Divergence vs F16

Metric	Q1_0_g128	Q1_0
Mean KLD	0.000645	0.000844
Median KLD	0.000500	0.000680
Maximum KLD	0.013948	0.020355
99th percentile KLD	0.003021	0.003804
Mean Δp	0.002%	-0.049%
Max Δp	7.80%	6.62%

Numbers seem higher than your NEON numbers.

philtomson and others added 2 commits April 2, 2026 08:08

github-actions bot added the ggml label Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for CPU backend + instructions for targetting AMD GPUs#5

Fixes for CPU backend + instructions for targetting AMD GPUs#5
philtomson wants to merge 2 commits intoPrismML-Eng:prismfrom
philtomson:prism

philtomson commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

philtomson commented Apr 3, 2026 •

edited

Loading

Uh oh!

philtomson commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

philtomson commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

philtomson commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philtomson commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KL Divergence Results: Bonsai-1.7B on wikitext-2-raw (100 chunks, ctx=512)

Perplexity

KL Divergence vs F16

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

philtomson commented Apr 3, 2026 •

edited

Loading

philtomson commented Apr 3, 2026 •

edited

Loading