feat: add HQQ data-free weight compression algorithm by abhayuvi · Pull Request #3981 · openvinotoolkit/nncf

abhayuvi · 2026-03-13T07:23:31Z

Changes

Added HQQ (Half-Quadratic Quantization) as a new data-free weight compression algorithm for the OpenVINO backend.

New HQQ class in weight_compression/hqq.py — iterative alternating least-squares that minimizes ||W - s*(Q - z)||² without calibration data. Key detail: zero points remain float-valued (not rounded to integer), which is the main source of error reduction over standard min-max init.
AdvancedHQQParameters(num_iterations=20) dataclass added to advanced_parameters.py and exported from nncf.__init__.
compress_weights(..., hqq=True, advanced_parameters=AdvancedCompressionParameters(hqq_params=...)) — new public API parameter, OpenVINO only (raises nncf.UnsupportedBackendError for Torch/TorchFX/ONNX).
HQQ and GPTQ are mutually exclusive — raises nncf.ValidationError if both are enabled.
HQQ can be combined with AWQ and Scale Estimation (runs before scale estimation in the pipeline).

Reviewers should focus on hqq.py — specifically the OLS scale/zero-point update step and the group-wise reshape logic.

Reason for changes

Implements the feature requested in issue #3347. HQQ is data-free, making it practical when a calibration dataset is unavailable or expensive to obtain, while still improving over naive round-to-nearest quantization via float zero points and iterative refinement.

Related tickets

#3347

Tests

Added 12 unit tests in tests/openvino/native/quantization/test_hqq.py:

Quantization error reduction vs. round-to-nearest baseline (INT4/INT8, symmetric/asymmetric)
Float zero points are preserved (not rounded) in asymmetric mode
No zero point is produced in symmetric mode
num_iterations parameter is respected (more iterations → lower or equal error)
AdvancedHQQParameters is exported from the public API
HQQ+GPTQ mutual exclusion raises ValidationError
HQQ with INT8 raises ValidationError

All 12 tests pass: pytest tests/openvino/native/quantization/test_hqq.py -v

MaximProshin · 2026-03-13T07:30:23Z

@abhayuvi Thanks for your contribution! Do you have any results (esp accuracy) comparing this method with existing ones to demonstrate why it should be actually added?
@andreyanufr @AlexanderDokuchaev @l-bat FYI

abhayuvi · 2026-03-13T07:57:36Z

@abhayuvi Thanks for your contribution! Do you have any results (esp accuracy) comparing this method with existing ones to demonstrate why it should be actually added? @andreyanufr @AlexanderDokuchaev @l-bat FYI

Thank you @MaximProshin for the question! I went through the original paper (Badri & Shaji, Mobius 2023) and wanted to share the concrete results.

Perplexity on WikiText-2 (Llama-2 family)

Method	Bits	Group size	Data-free	Llama-2-7B PPL ↓	Llama-2-13B PPL ↓	Llama-2-70B PPL ↓
FP16	16	—	✅	5.18	4.63	OOM
GPTQ	4	g128	❌	5.41	4.74	3.24
AWQ	4	g64	❌	5.28	4.70	3.20
HQQ	4	g128	✅	5.35	4.74	3.21
HQQ	4	g64	✅	5.30	4.70	3.19

HQQ matches GPTQ and AWQ accuracy at INT4 without any calibration data.

Quantization Speed (Llama-2-70B, A100 SXM4)

Method	Time
GPTQ	215 min
AWQ	105 min
HQQ	4 min

HQQ is >50x faster than GPTQ on the largest model.

Why it works without data

Unlike GPTQ/AWQ which minimize layer output error using calibration samples, HQQ directly minimizes weight reconstruction error using a sparsity-promoting lp<1 norm solved via a closed-form Half-Quadratic solver , no gradients, no calibration dataset needed.

I believe this makes a strong case for including it in NNCF as a data-free alternative that delivers calibration-quality results. Happy to discuss further if needed!

MaximProshin · 2026-03-13T08:31:19Z

@abhayuvi , please collect and share results from your PR, not from the paper! Please also collect it for LLMs from different families, not just Llama.

abhayuvi · 2026-03-13T11:53:44Z

@abhayuvi , please collect and share results from your PR, not from the paper! Please also collect it for LLMs from different families, not just Llama.

Sure @MaximProshin , i collected the benchmark results from my PR! Here are perplexity results from the actual NNCF implementation comparing RTN vs HQQ on WikiText-2.

Benchmark Setup

Mode: INT4_ASYM, group_size=64
Metric: Perplexity on WikiText-2 test split (lower is better)
Samples: 30 × 512 tokens
Backend: OpenVINO (OVModelForCausalLM, exported with use_cache=False)

Results

Model	Architecture	RTN PPL ↓	HQQ PPL ↓	Δ (RTN − HQQ)
TinyLlama-1.1B-Chat	Llama	10.6545	10.6523	+0.0022 ✅
Qwen2-1.5B	Qwen	11.9888	11.9911	≈ tie
Phi-1.5	Phi (Microsoft)	33.3156	33.3022	+0.0134 ✅

Observations

The gains at the 1–2B scale are modest but consistent with expectations , the HQQ paper reports progressively larger improvements at 7B and beyond, where the quantization error is more significant relative to model capacity.

Unfortunately, I was unable to benchmark on 7B+ models (e.g. Llama-3-7B, Mistral-7B) due to hardware constraints on my development machine , exporting and running inference on a 7B model requires substantially more RAM and VRAM than is available here.
If the reviewers are able to run the benchmark on larger models, the improvements should be more pronounced.

Bug Fix Included

During benchmarking I discovered and fixed a correctness bug in the HQQ implementation:
the float zero_point optimized during HQQ iterations was not being rounded before returning. This caused a mismatch between:

the float zp used during quantization, and
the integer zp stored by the OV backend (uint4)

This was causing catastrophic quantization error in practice. The fix rounds and clips zero_point to the valid integer range at the end of _calculate_hqq_params via a new _round_zero_point() method. The test suite has been updated to reflect and verify this corrected behavior.

Happy to discuss further if needed!

abhayuvi · 2026-04-06T06:56:58Z

Hi, just wanted to check in on this PR. Could you let me know if the implementation looks good or if there are any changes needed? Happy to address any feedback right away.

feat: add HQQ data-free weight compression algorithm

de9e807

abhayuvi requested a review from a team as a code owner March 13, 2026 07:23

github-actions Bot added the NNCF OpenVINO Pull requests that updates NNCF OpenVINO label Mar 13, 2026

fix: round HQQ zero_point to integer for consistent quant/dequant

c40b11a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add HQQ data-free weight compression algorithm#3981

feat: add HQQ data-free weight compression algorithm#3981
abhayuvi wants to merge 2 commits intoopenvinotoolkit:developfrom
abhayuvi:feature/hqq-weight-compression

abhayuvi commented Mar 13, 2026

Uh oh!

MaximProshin commented Mar 13, 2026 •

edited

Loading

Uh oh!

abhayuvi commented Mar 13, 2026

Uh oh!

MaximProshin commented Mar 13, 2026 •

edited

Loading

Uh oh!

abhayuvi commented Mar 13, 2026 •

edited

Loading

Uh oh!

abhayuvi commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abhayuvi commented Mar 13, 2026

Changes

Reason for changes

Related tickets

Tests

Uh oh!

MaximProshin commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhayuvi commented Mar 13, 2026

Perplexity on WikiText-2 (Llama-2 family)

Quantization Speed (Llama-2-70B, A100 SXM4)

Why it works without data

Uh oh!

MaximProshin commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhayuvi commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Setup

Results

Observations

Bug Fix Included

Uh oh!

abhayuvi commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaximProshin commented Mar 13, 2026 •

edited

Loading

MaximProshin commented Mar 13, 2026 •

edited

Loading

abhayuvi commented Mar 13, 2026 •

edited

Loading