Skip to content

fix: Q1_0_g128 CPU dot product int truncation#4

Open
Marxist-Leninist wants to merge 5 commits intoPrismML-Eng:prismfrom
Marxist-Leninist:fix/q1_0_g128-cpu-kernel-int-truncation
Open

fix: Q1_0_g128 CPU dot product int truncation#4
Marxist-Leninist wants to merge 5 commits intoPrismML-Eng:prismfrom
Marxist-Leninist:fix/q1_0_g128-cpu-kernel-int-truncation

Conversation

@Marxist-Leninist
Copy link
Copy Markdown

Summary

  • The accumulator sumi in ggml_vec_dot_q1_0_g128_q8_0 was declared as int but accumulates float d1 * int sumi_block, causing the float result to be truncated to integer on each loop iteration
  • This produced garbage output for Q1_0_g128 quantized models when running on CPU (x86 and generic/portable paths)
  • Fix: change int sumi = 0 to float sumi = 0 in both kernels

Files changed

  • ggml/src/ggml-cpu/arch/x86/quants.c — x86 kernel
  • ggml/src/ggml-cpu/quants.c — generic/portable kernel

Test

Verified locally on Windows (MinGW, CPU-only) with Bonsai-8B (Q1_0_g128):

  • Before fix: garbled/random token output
  • After fix: coherent, correct responses

The accumulator `sumi` in ggml_vec_dot_q1_0_g128_q8_0 was declared
as `int` but accumulates `float d1 * int sumi_block`, causing the
float result to be truncated to integer on each iteration. This
produced garbage output for Q1_0_g128 models on CPU.

Fix: change `int sumi = 0` to `float sumi = 0` in both the x86
and generic (portable) kernels.
@github-actions github-actions bot added the ggml label Apr 2, 2026
noreply added 4 commits April 2, 2026 16:30
The existing scalar fallback runs at ~0.2 t/s on CPUs without AVX-512
(Ryzen, Intel 12th+ gen consumer). This adds an AVX2 path that:
- Sign-extends int8->int16 in two 16-element passes per Q8_0 block
- Expands 1-bit weights to 16-bit masks via broadcast+AND+cmpeq
- Uses blendv to negate activations where weight bit=0
- Accumulates via madd_epi16 -> cvtepi32_ps -> fmadd_ps

AVX2 is supported on virtually all x86-64 CPUs from 2013+.
…sing

Replace two-pass int16 blendv approach with:
- Single-pass byte-level bit expansion (shuffle+AND+cmpeq)
- XOR+SUB negate trick (replaces slow blendv, 2-3 cyc -> 1 cyc each)
- maddubs+madd accumulation (stays in int8 longer)
- Fully unrolled k-loop (eliminates loop overhead + branch)

Benchmark on i7-10510U (AVX2+FMA, 4T):
  Scalar:  0.2 t/s prompt, 0.2 t/s gen
  AVX2 v1: 2.4 t/s prompt, 2.1 t/s gen (two-pass blendv)
  AVX2 v3: 4.7 t/s prompt, 3.1 t/s gen (this commit)

~15x faster than scalar, ~50% faster than v1.
Apply cache-blocking and prefetch optimizations from the COM6 matrix
multiplication library (github.com/Marxist-Leninist/COM6):

- Increase weight row block size from 16 to 64 for Q1_0_g128
  (1-bit rows are ~576 bytes at K=4096, 64 rows = 36KB fits in L1d)
- Add software prefetch of weight rows 4 iterations ahead,
  mirroring COM6 distributed prefetch strategy
- Enlarge tmp accumulator buffer to match larger block size

Benchmark on i7-10510U (4T, Bonsai-8B Q1_0_g128):
  Before: 3.14 t/s generation
  After:  3.43 t/s generation (+9%)
@khosravipasha
Copy link
Copy Markdown
Collaborator

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants