fix: Q1_0_g128 CPU dot product int truncation by Marxist-Leninist · Pull Request #4 · PrismML-Eng/llama.cpp

Marxist-Leninist · 2026-04-02T16:18:20Z

Summary

The accumulator sumi in ggml_vec_dot_q1_0_g128_q8_0 was declared as int but accumulates float d1 * int sumi_block, causing the float result to be truncated to integer on each loop iteration
This produced garbage output for Q1_0_g128 quantized models when running on CPU (x86 and generic/portable paths)
Fix: change int sumi = 0 to float sumi = 0 in both kernels

Files changed

ggml/src/ggml-cpu/arch/x86/quants.c — x86 kernel
ggml/src/ggml-cpu/quants.c — generic/portable kernel

Test

Verified locally on Windows (MinGW, CPU-only) with Bonsai-8B (Q1_0_g128):

Before fix: garbled/random token output
After fix: coherent, correct responses

The accumulator `sumi` in ggml_vec_dot_q1_0_g128_q8_0 was declared as `int` but accumulates `float d1 * int sumi_block`, causing the float result to be truncated to integer on each iteration. This produced garbage output for Q1_0_g128 models on CPU. Fix: change `int sumi = 0` to `float sumi = 0` in both the x86 and generic (portable) kernels.

The existing scalar fallback runs at ~0.2 t/s on CPUs without AVX-512 (Ryzen, Intel 12th+ gen consumer). This adds an AVX2 path that: - Sign-extends int8->int16 in two 16-element passes per Q8_0 block - Expands 1-bit weights to 16-bit masks via broadcast+AND+cmpeq - Uses blendv to negate activations where weight bit=0 - Accumulates via madd_epi16 -> cvtepi32_ps -> fmadd_ps AVX2 is supported on virtually all x86-64 CPUs from 2013+.

…sing Replace two-pass int16 blendv approach with: - Single-pass byte-level bit expansion (shuffle+AND+cmpeq) - XOR+SUB negate trick (replaces slow blendv, 2-3 cyc -> 1 cyc each) - maddubs+madd accumulation (stays in int8 longer) - Fully unrolled k-loop (eliminates loop overhead + branch) Benchmark on i7-10510U (AVX2+FMA, 4T): Scalar: 0.2 t/s prompt, 0.2 t/s gen AVX2 v1: 2.4 t/s prompt, 2.1 t/s gen (two-pass blendv) AVX2 v3: 4.7 t/s prompt, 3.1 t/s gen (this commit) ~15x faster than scalar, ~50% faster than v1.

Apply cache-blocking and prefetch optimizations from the COM6 matrix multiplication library (github.com/Marxist-Leninist/COM6): - Increase weight row block size from 16 to 64 for Q1_0_g128 (1-bit rows are ~576 bytes at K=4096, 64 rows = 36KB fits in L1d) - Add software prefetch of weight rows 4 iterations ahead, mirroring COM6 distributed prefetch strategy - Enlarge tmp accumulator buffer to match larger block size Benchmark on i7-10510U (4T, Bonsai-8B Q1_0_g128): Before: 3.14 t/s generation After: 3.43 t/s generation (+9%)

khosravipasha · 2026-04-02T18:45:49Z

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

github-actions bot added the ggml label Apr 2, 2026

noreply added 4 commits April 2, 2026 16:30

chore: track gguf files with git-lfs

77c6355

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Q1_0_g128 CPU dot product int truncation#4

fix: Q1_0_g128 CPU dot product int truncation#4
Marxist-Leninist wants to merge 5 commits intoPrismML-Eng:prismfrom
Marxist-Leninist:fix/q1_0_g128-cpu-kernel-int-truncation

Marxist-Leninist commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Marxist-Leninist commented Apr 2, 2026

Summary

Files changed

Test

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants