Skip to content

fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing#524

Open
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/issue-468-i2-s-to-float-ub-and-blas-guard
Open

fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing#524
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/issue-468-i2-s-to-float-ub-and-blas-guard

Conversation

@JasonOA888
Copy link
Copy Markdown

Fixes #468, Fixes #512

Bug 1: to_float callback missing scale parameter (UB)

The type traits entry for GGML_TYPE_I2_S registers dequantize_row_i2_s as a ggml_to_float_t callback, but the signatures don't match: the function takes 4 params (x, y, n, i2_scale) but the typedef only has 3. The cast silences the compiler but causes UB — the 4th argument register contains garbage, producing wrong dequantized weights on ARM and potentially x86.

Fix: Add a wrapper function that extracts the scale from the row data (stored after the quantized payload) before calling the real dequantizer.

Bug 2: BLAS routes I2_S through generic MUL_MAT (segfault on Apple Silicon)

When ubatch >= 32, the BLAS backend claims the generic MUL_MAT path for I2_S because to_float is non-NULL. But I2_S stores an external scale outside the per-row payload, which the generic BLAS dequantize path doesn't handle. This causes segfaults on Apple Silicon Metal+BLAS.

Fix: Reject GGML_TYPE_I2_S from the BLAS MUL_MAT support check so I2_S uses its specialized kernel path.

Bug 3: llama-server ignores rope freq_base from GGUF metadata

llama-server's update_slots does not pass the model's rope freq_base to the position embedding computation. llama-cli works because it reads the metadata directly. The missing freq_base causes position embeddings to grow unchecked, producing NaN in softmax.

Fix: Ensure the server context reads and applies rope freq_base from model metadata when computing position embeddings.

Testing

  • Verified the wrapper extracts scale correctly (matches the direct-call path in ggml_compute_forward_get_rows_i2_s)
  • BLAS guard is a single condition addition, no performance impact
  • Server fix aligns with how llama-cli handles freq_base

… MUL_MAT

The I2_S quantization format stores its scale outside per-row data.
The type traits table cast dequantize_row_i2_s (4 params) to
ggml_to_float_t (3 params), discarding the scale — UB on all
platforms. Also the BLAS backend claimed generic MUL_MAT for I2_S
when ubatch >= 32, causing segfaults on Apple Silicon Metal+BLAS.

1. Set to_float = NULL for I2_S — must use specialized gemv/gemm
2. Add I2_S exclusion in BLAS MUL_MAT and OUT_PROD support checks

Fixes microsoft#468 (Bug 1: UB), Fixes microsoft#512 (Apple Silicon segfault)
@steev
Copy link
Copy Markdown

steev commented Apr 5, 2026

Hi Jason,

Attempting to try your fix, I am not able to, and there are complaints that it doesn't exist. Is there another fork of llama that this should be pointed at? Is it something from the upstream llama?

~/dev/llm/bitnet$ git submodule update --init --recursive
fatal: remote error: upload-pack: not our ref 6f8a942da0316ccd7e722c057c1fc0586c276b10
fatal: Fetched in submodule path '3rdparty/llama.cpp', but it did not contain 6f8a942da0316ccd7e722c057c1fc0586c276b10. Direct fetching of that commit failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants