feat: integrate KVPress for KV cache compression (#366)#623
Open
kschwethelm wants to merge 7 commits intoPrunaAI:mainfrom
Open
feat: integrate KVPress for KV cache compression (#366)#623kschwethelm wants to merge 7 commits intoPrunaAI:mainfrom
kschwethelm wants to merge 7 commits intoPrunaAI:mainfrom
Conversation
Add NVIDIA KVPress as an optional dependency, enabling 31 KV cache compression strategies for causal language models. Includes algorithm class, test tester, and compatibility updates across existing LLM algorithms.
kvpress 0.5.2 relaxes the datasets<3 constraint and reverts to transformers>=4.56, resolving the dependency conflict. uv sync --extra kvpress now works without workarounds.
Allow passing additional keyword arguments to the press constructor via the press_kwargs hyperparameter, enabling fine-grained control over press-specific settings like window_size, n_sink, etc.
- Replace tags.QUANTIZER with explicit LLM algorithm names to avoid false symmetry matches with diffuser algorithms - Fix SmashConfig.add() dict flattening: only flatten when key is a registered algorithm name, not for dict-valued hyperparameters - Remove wrapper/special presses from PRESS_TYPES (CriticalKVPress and others that don't accept compression_ratio directly) - Add unit tests for press type validation and kwargs forwarding - Add SnapKV integration test with press_kwargs
Add a new KV_CACHER algorithm tag for KV cache compression algorithms, separate from CACHER (used by diffuser cachers). Use the tag in all LLM algorithm compatibility lists instead of explicit "kvpress" strings.
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 5 minor |
| Security | 9 high |
🟢 Metrics 15 complexity · 0 duplication
Metric Results Complexity 15 Duplication 0
TIP This summary will be updated as you push new changes. Give us feedback
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Integrate KVPress into Pruna, making 20 KV cache compression strategies available for causal language models. KVPress compresses the key-value cache during the prefill phase, reducing memory usage for long-context inference.
Key implementation details:
kvpressalgorithm module following thePrunaAlgorithmBasepattern
compression_ratioandpress_kwargsfor press-specific parametersKV_CACHERalgorithm tag for the cache compression categoryreapplysave strategy — press is re-applied on model loadExcluded press types: Wrapper presses (ChunkPress, AdaKVPress, PerLayerCompressionPress, DMSPress, etc.) are not included in this initial integration. These require a nested
ScorerPressinstance as a constructor argument, which doesn't fit the current single-class design. Similarly, ThinKPress is excluded as it compresses along the channel dimension with a different parameter interface. These could be added in a follow-up if needed.Some downstream evaluation results are available in repo kschwethelm/pruna-kvpress-eval.
Related Issue
Fixes #366
Type of Change
Testing
uv run pytest -m "cpu and not slow")Unit tests added in
tests/algorithms/test_kvpress.pywith a dedicated tester intests/algorithms/testers/kvpress.py. Integration evaluated in a separate repo -> see evaluation report.Checklist