Skip to content

feat: port TQ3_0 KV cache from llama-turboquant#2

Open
carlosfundora wants to merge 1 commit intoPrismML-Eng:prismfrom
carlosfundora:feature/tq3_0-kv-cache
Open

feat: port TQ3_0 KV cache from llama-turboquant#2
carlosfundora wants to merge 1 commit intoPrismML-Eng:prismfrom
carlosfundora:feature/tq3_0-kv-cache

Conversation

@carlosfundora
Copy link
Copy Markdown

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.

TurboQuant 3-bit (3.5 bpw) KV cache compression:
- Per-block WHT rotation with 4-centroid MSE codebook
- QJL residual signs for error correction
- GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy
- CPU: quantize/dequantize with WHT butterfly transform
- Flash attention auto-disabled for TQ3_0 K cache

Combined with PrismML's Q1_0 GPU inference, this enables
1-bit weights + 3-bit KV cache on a single build.
@khosravipasha
Copy link
Copy Markdown
Collaborator

Thanks this is pretty cool. How does it work? It is good?

Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it.
What the vram usage with long context after this change?

@carlosfundora
Copy link
Copy Markdown
Author

b2a77edf-03cb-4a58-b569-9a148a6ee24b.jpg

It works great. I have SGLang nearly wired up for 1-bit support and TurboQuant as well.

@carlosfundora
Copy link
Copy Markdown
Author

VRAM usage was reduced by roughly 35%.

@khosravipasha
Copy link
Copy Markdown
Collaborator

Oh how does it work with SGlang for 1-bit, was it easy to add support there?

@carlosfundora
Copy link
Copy Markdown
Author

carlosfundora commented Apr 4, 2026

So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do.

If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌

@khosravipasha
Copy link
Copy Markdown
Collaborator

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants