Skip to content

Releases: vlora-dev/vlora

v0.3.0 — NF4 Quantization & QLoRA Support

31 Mar 03:12

Choose a tag to compare

Highlights

First-class QLoRA support — vLoRA now integrates with QLoRA workflows for maximum compression. QLoRA compresses the base model (FP16 → NF4), vLoRA compresses the adapter space — these stack multiplicatively.

New Features

  • NF4 quantizationsubspace.quantize(method="nf4") uses QLoRA's 4-bit NormalFloat data type with per-block absmax scaling
  • Double quantization — quantize the NF4 block scales to FP8 via double_quant=True
  • NF4 packed storagesave_quantized() packs to uint8 for ~7× disk savings; load() auto-detects format
  • QLoRA-aware VLoRAModelcompute_dtype for mixed-precision, qlora_info for base model introspection
  • full_stack_compression() — combined base model + adapter compression reporting
  • Layer shapes stored in metadata, __repr__ on core objects, adaptive_k preserved through absorb

Bug Fixes

  • absorb_incremental re-projection — existing tasks now properly re-projected when basis rotates
  • VLoRACallback was a no-op — now uses differentiable hooks + steps optimizer
  • TIES merge normalization — fixed over-scaling when elements are trimmed
  • 7 additional correctness and robustness fixes (see CHANGELOG.md)

Performance

  • gram_schmidt → QR factorization
  • Module handle caching in VLoRAModel
  • NF4 uses torch.bucketize (O(N) memory vs O(N×16))

197 tests (196 passed, 5 skipped without transformers)

Full changelog: CHANGELOG.md

pip install vlora-dev==0.3.0