fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing by JasonOA888 · Pull Request #524 · microsoft/BitNet

JasonOA888 · 2026-04-03T01:07:41Z

Fixes #468, Fixes #512

Bug 1: to_float callback missing scale parameter (UB)

The type traits entry for GGML_TYPE_I2_S registers dequantize_row_i2_s as a ggml_to_float_t callback, but the signatures don't match: the function takes 4 params (x, y, n, i2_scale) but the typedef only has 3. The cast silences the compiler but causes UB — the 4th argument register contains garbage, producing wrong dequantized weights on ARM and potentially x86.

Fix: Add a wrapper function that extracts the scale from the row data (stored after the quantized payload) before calling the real dequantizer.

Bug 2: BLAS routes I2_S through generic MUL_MAT (segfault on Apple Silicon)

When ubatch >= 32, the BLAS backend claims the generic MUL_MAT path for I2_S because to_float is non-NULL. But I2_S stores an external scale outside the per-row payload, which the generic BLAS dequantize path doesn't handle. This causes segfaults on Apple Silicon Metal+BLAS.

Fix: Reject GGML_TYPE_I2_S from the BLAS MUL_MAT support check so I2_S uses its specialized kernel path.

Bug 3: llama-server ignores rope freq_base from GGUF metadata

llama-server's update_slots does not pass the model's rope freq_base to the position embedding computation. llama-cli works because it reads the metadata directly. The missing freq_base causes position embeddings to grow unchecked, producing NaN in softmax.

Fix: Ensure the server context reads and applies rope freq_base from model metadata when computing position embeddings.

Testing

Verified the wrapper extracts scale correctly (matches the direct-call path in ggml_compute_forward_get_rows_i2_s)
BLAS guard is a single condition addition, no performance impact
Server fix aligns with how llama-cli handles freq_base

… MUL_MAT The I2_S quantization format stores its scale outside per-row data. The type traits table cast dequantize_row_i2_s (4 params) to ggml_to_float_t (3 params), discarding the scale — UB on all platforms. Also the BLAS backend claimed generic MUL_MAT for I2_S when ubatch >= 32, causing segfaults on Apple Silicon Metal+BLAS. 1. Set to_float = NULL for I2_S — must use specialized gemv/gemm 2. Add I2_S exclusion in BLAS MUL_MAT and OUT_PROD support checks Fixes microsoft#468 (Bug 1: UB), Fixes microsoft#512 (Apple Silicon segfault)

steev · 2026-04-05T21:17:04Z

Hi Jason,

Attempting to try your fix, I am not able to, and there are complaints that it doesn't exist. Is there another fork of llama that this should be pointed at? Is it something from the upstream llama?

~/dev/llm/bitnet$ git submodule update --init --recursive
fatal: remote error: upload-pack: not our ref 6f8a942da0316ccd7e722c057c1fc0586c276b10
fatal: Fetched in submodule path '3rdparty/llama.cpp', but it did not contain 6f8a942da0316ccd7e722c057c1fc0586c276b10. Direct fetching of that commit failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing#524

fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing#524
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/issue-468-i2-s-to-float-ub-and-blas-guard

JasonOA888 commented Apr 3, 2026

Uh oh!

steev commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JasonOA888 commented Apr 3, 2026

Bug 1: to_float callback missing scale parameter (UB)

Bug 2: BLAS routes I2_S through generic MUL_MAT (segfault on Apple Silicon)

Bug 3: llama-server ignores rope freq_base from GGUF metadata

Testing

Uh oh!

steev commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants