feat: integrate KVPress for KV cache compression (#366) by kschwethelm · Pull Request #623 · PrunaAI/pruna

kschwethelm · 2026-04-10T14:58:20Z

Description

Integrate KVPress into Pruna, making 20 KV cache compression strategies available for causal language models. KVPress compresses the key-value cache during the prefill phase, reducing memory usage for long-context inference.

Key implementation details:

New kvpress algorithm module following the PrunaAlgorithmBase
pattern
Supports 20 scorer presses (ExpectedAttention, SnapKV, StreamingLLM, TOVA, KVzip, etc.)
Configurable compression_ratio and press_kwargs for press-specific parameters
New KV_CACHER algorithm tag for the cache compression category
Compatibility defined with quantization algorithms (before) and torch_compile (after)
Uses reapply save strategy — press is re-applied on model load

Excluded press types: Wrapper presses (ChunkPress, AdaKVPress, PerLayerCompressionPress, DMSPress, etc.) are not included in this initial integration. These require a nested ScorerPress instance as a constructor argument, which doesn't fit the current single-class design. Similarly, ThinKPress is excluded as it compresses along the channel dimension with a different parameter interface. These could be added in a follow-up if needed.

Some downstream evaluation results are available in repo kschwethelm/pruna-kvpress-eval.

Related Issue

Fixes #366

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor (no functional change)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Testing

I added or updated tests covering my changes
Existing tests pass locally (uv run pytest -m "cpu and not slow")

Unit tests added in tests/algorithms/test_kvpress.py with a dedicated tester in tests/algorithms/testers/kvpress.py. Integration evaluated in a separate repo -> see evaluation report.

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my code, especially for agent-assisted changes
I updated the documentation where necessary

Add NVIDIA KVPress as an optional dependency, enabling 31 KV cache compression strategies for causal language models. Includes algorithm class, test tester, and compatibility updates across existing LLM algorithms.

kvpress 0.5.2 relaxes the datasets<3 constraint and reverts to transformers>=4.56, resolving the dependency conflict. uv sync --extra kvpress now works without workarounds.

Allow passing additional keyword arguments to the press constructor via the press_kwargs hyperparameter, enabling fine-grained control over press-specific settings like window_size, n_sink, etc.

- Replace tags.QUANTIZER with explicit LLM algorithm names to avoid false symmetry matches with diffuser algorithms - Fix SmashConfig.add() dict flattening: only flatten when key is a registered algorithm name, not for dict-valued hyperparameters - Remove wrapper/special presses from PRESS_TYPES (CriticalKVPress and others that don't accept compression_ratio directly) - Add unit tests for press type validation and kwargs forwarding - Add SnapKV integration test with press_kwargs

Add a new KV_CACHER algorithm tag for KV cache compression algorithms, separate from CACHER (used by diffuser cachers). Use the tag in all LLM algorithm compatibility lists instead of explicit "kvpress" strings.

codacy-production · 2026-04-10T14:59:32Z

Not up to standards ⛔

🔴 Issues 9 high · 5 minor

Alerts:
⚠ 14 issues (≤ 0 issues of at least minor severity)

Results:
14 new issues

Category Results

Documentation 5 minor

Security 9 high

View in Codacy

🟢 Metrics 15 complexity · 0 duplication

Metric Results

Complexity 15

Duplication 0

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

kschwethelm added 7 commits April 10, 2026 13:58

feat: integrate KVPress for KV cache compression

6a861a2

Add NVIDIA KVPress as an optional dependency, enabling 31 KV cache compression strategies for causal language models. Includes algorithm class, test tester, and compatibility updates across existing LLM algorithms.

feat: bump kvpress to >=0.5.2, add FastKVzipPress

a045203

kvpress 0.5.2 relaxes the datasets<3 constraint and reverts to transformers>=4.56, resolving the dependency conflict. uv sync --extra kvpress now works without workarounds.

feat: add press_kwargs for press-specific parameters

94dadb5

Allow passing additional keyword arguments to the press constructor via the press_kwargs hyperparameter, enabling fine-grained control over press-specific settings like window_size, n_sink, etc.

feat: add KV_CACHER tag, replace explicit kvpress references

e5f1c8f

Add a new KV_CACHER algorithm tag for KV cache compression algorithms, separate from CACHER (used by diffuser cachers). Use the tag in all LLM algorithm compatibility lists instead of explicit "kvpress" strings.

refactor: rename KV_CACHER tag to KV_COMPRESSOR, improve docstrings

8e818bc

docs: document excluded wrapper presses in kvpress docstring

7f8b282

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate KVPress for KV cache compression (#366)#623

feat: integrate KVPress for KV cache compression (#366)#623
kschwethelm wants to merge 7 commits intoPrunaAI:mainfrom
kschwethelm:feat/kvpress

kschwethelm commented Apr 10, 2026

Uh oh!

codacy-production bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kschwethelm commented Apr 10, 2026

Description

Related Issue

Type of Change

Testing

Checklist

Uh oh!

codacy-production bot commented Apr 10, 2026

Not up to standards ⛔

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant