feat: add HQQ data-free weight compression algorithm#3981
feat: add HQQ data-free weight compression algorithm#3981abhayuvi wants to merge 2 commits intoopenvinotoolkit:developfrom
Conversation
|
@abhayuvi Thanks for your contribution! Do you have any results (esp accuracy) comparing this method with existing ones to demonstrate why it should be actually added? |
Thank you @MaximProshin for the question! I went through the original paper (Badri & Shaji, Mobius 2023) and wanted to share the concrete results. Perplexity on WikiText-2 (Llama-2 family)
HQQ matches GPTQ and AWQ accuracy at INT4 without any calibration data. Quantization Speed (Llama-2-70B, A100 SXM4)
HQQ is >50x faster than GPTQ on the largest model. Why it works without dataUnlike GPTQ/AWQ which minimize layer output error using calibration samples, HQQ directly minimizes weight reconstruction error using a sparsity-promoting lp<1 norm solved via a closed-form Half-Quadratic solver , no gradients, no calibration dataset needed. I believe this makes a strong case for including it in NNCF as a data-free alternative that delivers calibration-quality results. Happy to discuss further if needed! |
|
@abhayuvi , please collect and share results from your PR, not from the paper! Please also collect it for LLMs from different families, not just Llama. |
Sure @MaximProshin , i collected the benchmark results from my PR! Here are perplexity results from the actual NNCF implementation comparing RTN vs HQQ on WikiText-2. Benchmark Setup
Results
ObservationsThe gains at the 1–2B scale are modest but consistent with expectations , the HQQ paper reports progressively larger improvements at 7B and beyond, where the quantization error is more significant relative to model capacity. Unfortunately, I was unable to benchmark on 7B+ models (e.g. Llama-3-7B, Mistral-7B) due to hardware constraints on my development machine , exporting and running inference on a 7B model requires substantially more RAM and VRAM than is available here. Bug Fix IncludedDuring benchmarking I discovered and fixed a correctness bug in the HQQ implementation:
This was causing catastrophic quantization error in practice. The fix rounds and clips Happy to discuss further if needed! |
|
Hi, just wanted to check in on this PR. Could you let me know if the implementation looks good or if there are any changes needed? Happy to address any feedback right away. |
Changes
Added HQQ (Half-Quadratic Quantization) as a new data-free weight compression algorithm for the OpenVINO backend.
HQQclass inweight_compression/hqq.py— iterative alternating least-squares that minimizes||W - s*(Q - z)||²without calibration data. Key detail: zero points remain float-valued (not rounded to integer), which is the main source of error reduction over standard min-max init.AdvancedHQQParameters(num_iterations=20)dataclass added toadvanced_parameters.pyand exported fromnncf.__init__.compress_weights(..., hqq=True, advanced_parameters=AdvancedCompressionParameters(hqq_params=...))— new public API parameter, OpenVINO only (raisesnncf.UnsupportedBackendErrorfor Torch/TorchFX/ONNX).nncf.ValidationErrorif both are enabled.Reviewers should focus on
hqq.py— specifically the OLS scale/zero-point update step and the group-wise reshape logic.Reason for changes
Implements the feature requested in issue #3347. HQQ is data-free, making it practical when a calibration dataset is unavailable or expensive to obtain, while still improving over naive round-to-nearest quantization via float zero points and iterative refinement.
Related tickets
#3347
Tests
Added 12 unit tests in
tests/openvino/native/quantization/test_hqq.py:num_iterationsparameter is respected (more iterations → lower or equal error)AdvancedHQQParametersis exported from the public APIValidationErrorValidationErrorAll 12 tests pass:
pytest tests/openvino/native/quantization/test_hqq.py -v