Skip to content

feat: integrate Llama.cpp and enhance engine stability for cross-platform usage#616

Open
krishjp wants to merge 11 commits intoPrunaAI:mainfrom
krishjp:feat/llama-cpp
Open

feat: integrate Llama.cpp and enhance engine stability for cross-platform usage#616
krishjp wants to merge 11 commits intoPrunaAI:mainfrom
krishjp:feat/llama-cpp

Conversation

@krishjp
Copy link
Copy Markdown

@krishjp krishjp commented Apr 6, 2026

Description

This PR integrates the Llama.cpp quantizer engine into Pruna, enabling GGUF-based quantization. In addition to the new feature, this PR addresses critical compatibility issues for Python 3.13 and improves cross-platform robustness on Windows.

Key Changes:

  • Engine Support: Integrated llama-cpp-python as a new quantizer backend, supporting various GGUF quantization methods (e.g., q4_k_m).
  • Python 3.13 Compatibility: Fixed a KeyError in [SAVE_FUNCTIONS] and LOAD_FUNCTIONS by explicitly using enum.member() for callable members (with a backward-compatible fallback for older Python versions).
  • Stability: Implemented safer cache directory cleanup in [SmashConfig] to prevent AttributeError during interpreter shutdown on Windows.
  • Consistency: Added a [save()] alias to [PrunaModel] to match [save_pretrained()] and ensure consistent attribute delegation for non-torch backends.
  • Dependencies: Added the llamacpp optional dependency group and updated the full extra in [pyproject.toml].

Related Issue

Fixes #377

Related PRs

#583 - takes a more general look at the enum modification

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Integration Tests: Verified that [TestLlamaCpp] llama_cpp.py passes successfully on Windows using Python 3.12 and 3.13.
  • Diagnostic Scripts: Confirmed correct Enum member registration for engine save/load functions.
  • Local Benchmarking Script: Successfully smash'ed SmolLM2-135M-Instruct using llama.cpp q4_k_m quantization.
    • Compression: 4.88x reduction in model size.
    • Speedup: 4.14x faster inference results (tokens/sec) on CPU.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

The TypeError occasionally observed during llama-cpp-python shutdown is a known upstream issue in their del implementation during interpreter termination and does not affect the performance or correctness of the Smash/Save operations.

krishjp and others added 7 commits April 6, 2026 10:34
…device checks for llama-cpp models due to a lack of model.parameters() support
…on 3.13

- addressed functools.partial object compatability with py 3.13
- integrated enum.member() in SAVE_FUNCTIONS and LOAD_FUNCTIONS
- updated the LlamaCpp algorithm implementation to utilize the standardized
  naming convention.
- cleaned up redundant commented-out logic in the save_pruna_model function.

Verified through restoration of LlamaCpp integration tests and diagnostic
scripts confirming Enum member registration.
…form usage

- standardized LlamaCpp implementation and naming conventions within the engine
- implemented cache directory cleanup to prevent shutdown errors on Windows
- added a save() alias to the base model wrapper for improved API consistency
- updated project configuration with Llama.cpp and dependency group
- benchmarked using SmolLM2-135M-Instruct with q4_k_m quantization
- added Int class for integer-based configuration.
- updated get_device and model_checks for llama_cpp.
- implemented secure conversion script caching.
- enabled TestLlamaCpp and removed manual test overrides.
@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 6, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 49 complexity · 0 duplication

Metric Results
Complexity 49
Duplication 0

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@krishjp krishjp changed the title Feat/llama cpp feat: integrate Llama.cpp and enhance engine stability for cross-platform usage Apr 6, 2026
@krishjp
Copy link
Copy Markdown
Author

krishjp commented Apr 6, 2026

Hi @llcnt and @gsprochette! Here is an updated draft PR to replace #584.
I'm looking at the last few codacy issues that were brought up but the main codebase changes should be in place. ruff check also found some other fixes from older commits, so they are included here as well.

@krishjp
Copy link
Copy Markdown
Author

krishjp commented Apr 7, 2026

@cursor review

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 93fad34. Configure here.

@krishjp krishjp marked this pull request as ready for review April 7, 2026 15:48
Copy link
Copy Markdown
Collaborator

@llcnt llcnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the improved version of the PR!
We are definitely very close to the final step:)

processor_required: bool = False
dataset_required: bool = False
runs_on: list[str] = ["cpu", "cuda", "mps"]
compatible_before: list[str] = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that reduce_noe algo is compatible before !

"n_gpu_layers",
sequence=[0, 1, 4, 8, 16, 32, 999],
default_value=0,
meta={"desc": "Number of layers to offload to GPU. Use 999 for all layers."},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using '999' here and not '-1' as in llamacpp ? I guess you can define the Int to accept such negative value, no ?

def _load_quantized_model(self, llama_cpp: Any, quant_gguf_path: Path, smash_config: Any, temp_dir: Path) -> Any:
pruna_logger.info(f"Loading quantized model from {quant_gguf_path}")
n_gpu_layers = smash_config["n_gpu_layers"]
if n_gpu_layers == 999:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above ;)

"""Set the model to evaluation mode."""
set_to_eval(self.model)

def save(self, model_path: str) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this alias ?

raise FileNotFoundError(f"GGUF file not found at {model_path}")

model = llama_cpp.Llama(model_path=str(model_path), **filter_load_kwargs(llama_cpp.Llama.__init__, kwargs))
model.model_path = str(model_path)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as in llama_cpp.py file :)

n_gpu_layers=n_gpu_layers,
main_gpu=smash_config["main_gpu"],
)
quantized_model.model_path = str(quant_gguf_path)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this ?


# if save-before-move was the last operation, we simply move the already saved files, we have delt with them before
elif smash_config.save_fns[-1] == SAVE_FUNCTIONS.save_before_apply.name:
elif len(smash_config.save_fns) > 0 and smash_config.save_fns[-1] == get_fn_name(SAVE_FUNCTIONS.save_before_apply):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we keep the comment just above for reference ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Integrate llama.cpp as a Quantizer

2 participants