fix: Q1_0_g128 CPU dot product int truncation#4
Open
Marxist-Leninist wants to merge 5 commits intoPrismML-Eng:prismfrom
Open
fix: Q1_0_g128 CPU dot product int truncation#4Marxist-Leninist wants to merge 5 commits intoPrismML-Eng:prismfrom
Marxist-Leninist wants to merge 5 commits intoPrismML-Eng:prismfrom
Conversation
The accumulator `sumi` in ggml_vec_dot_q1_0_g128_q8_0 was declared as `int` but accumulates `float d1 * int sumi_block`, causing the float result to be truncated to integer on each iteration. This produced garbage output for Q1_0_g128 models on CPU. Fix: change `int sumi = 0` to `float sumi = 0` in both the x86 and generic (portable) kernels.
The existing scalar fallback runs at ~0.2 t/s on CPUs without AVX-512 (Ryzen, Intel 12th+ gen consumer). This adds an AVX2 path that: - Sign-extends int8->int16 in two 16-element passes per Q8_0 block - Expands 1-bit weights to 16-bit masks via broadcast+AND+cmpeq - Uses blendv to negate activations where weight bit=0 - Accumulates via madd_epi16 -> cvtepi32_ps -> fmadd_ps AVX2 is supported on virtually all x86-64 CPUs from 2013+.
…sing Replace two-pass int16 blendv approach with: - Single-pass byte-level bit expansion (shuffle+AND+cmpeq) - XOR+SUB negate trick (replaces slow blendv, 2-3 cyc -> 1 cyc each) - maddubs+madd accumulation (stays in int8 longer) - Fully unrolled k-loop (eliminates loop overhead + branch) Benchmark on i7-10510U (AVX2+FMA, 4T): Scalar: 0.2 t/s prompt, 0.2 t/s gen AVX2 v1: 2.4 t/s prompt, 2.1 t/s gen (two-pass blendv) AVX2 v3: 4.7 t/s prompt, 3.1 t/s gen (this commit) ~15x faster than scalar, ~50% faster than v1.
Apply cache-blocking and prefetch optimizations from the COM6 matrix multiplication library (github.com/Marxist-Leninist/COM6): - Increase weight row block size from 16 to 64 for Q1_0_g128 (1-bit rows are ~576 bytes at K=4096, 64 rows = 36KB fits in L1d) - Add software prefetch of weight rows 4 iterations ahead, mirroring COM6 distributed prefetch strategy - Enlarge tmp accumulator buffer to match larger block size Benchmark on i7-10510U (4T, Bonsai-8B Q1_0_g128): Before: 3.14 t/s generation After: 3.43 t/s generation (+9%)
Collaborator
|
This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones. Could you run the KL divergence tests described here: #8 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sumiinggml_vec_dot_q1_0_g128_q8_0was declared asintbut accumulatesfloat d1 * int sumi_block, causing the float result to be truncated to integer on each loop iterationint sumi = 0tofloat sumi = 0in both kernelsFiles changed
ggml/src/ggml-cpu/arch/x86/quants.c— x86 kernelggml/src/ggml-cpu/quants.c— generic/portable kernelTest
Verified locally on Windows (MinGW, CPU-only) with Bonsai-8B (Q1_0_g128):