Skip to content

feat: add HQQ data-free weight compression algorithm#3981

Open
abhayuvi wants to merge 2 commits intoopenvinotoolkit:developfrom
abhayuvi:feature/hqq-weight-compression
Open

feat: add HQQ data-free weight compression algorithm#3981
abhayuvi wants to merge 2 commits intoopenvinotoolkit:developfrom
abhayuvi:feature/hqq-weight-compression

Conversation

@abhayuvi
Copy link
Copy Markdown

Changes

Added HQQ (Half-Quadratic Quantization) as a new data-free weight compression algorithm for the OpenVINO backend.

  • New HQQ class in weight_compression/hqq.py — iterative alternating least-squares that minimizes ||W - s*(Q - z)||² without calibration data. Key detail: zero points remain float-valued (not rounded to integer), which is the main source of error reduction over standard min-max init.
  • AdvancedHQQParameters(num_iterations=20) dataclass added to advanced_parameters.py and exported from nncf.__init__.
  • compress_weights(..., hqq=True, advanced_parameters=AdvancedCompressionParameters(hqq_params=...)) — new public API parameter, OpenVINO only (raises nncf.UnsupportedBackendError for Torch/TorchFX/ONNX).
  • HQQ and GPTQ are mutually exclusive — raises nncf.ValidationError if both are enabled.
  • HQQ can be combined with AWQ and Scale Estimation (runs before scale estimation in the pipeline).

Reviewers should focus on hqq.py — specifically the OLS scale/zero-point update step and the group-wise reshape logic.

Reason for changes

Implements the feature requested in issue #3347. HQQ is data-free, making it practical when a calibration dataset is unavailable or expensive to obtain, while still improving over naive round-to-nearest quantization via float zero points and iterative refinement.

Related tickets

#3347

Tests

Added 12 unit tests in tests/openvino/native/quantization/test_hqq.py:

  • Quantization error reduction vs. round-to-nearest baseline (INT4/INT8, symmetric/asymmetric)
  • Float zero points are preserved (not rounded) in asymmetric mode
  • No zero point is produced in symmetric mode
  • num_iterations parameter is respected (more iterations → lower or equal error)
  • AdvancedHQQParameters is exported from the public API
  • HQQ+GPTQ mutual exclusion raises ValidationError
  • HQQ with INT8 raises ValidationError

All 12 tests pass: pytest tests/openvino/native/quantization/test_hqq.py -v

@abhayuvi abhayuvi requested a review from a team as a code owner March 13, 2026 07:23
@github-actions github-actions Bot added the NNCF OpenVINO Pull requests that updates NNCF OpenVINO label Mar 13, 2026
@MaximProshin
Copy link
Copy Markdown
Collaborator

MaximProshin commented Mar 13, 2026

@abhayuvi Thanks for your contribution! Do you have any results (esp accuracy) comparing this method with existing ones to demonstrate why it should be actually added?
@andreyanufr @AlexanderDokuchaev @l-bat FYI

@abhayuvi
Copy link
Copy Markdown
Author

@abhayuvi Thanks for your contribution! Do you have any results (esp accuracy) comparing this method with existing ones to demonstrate why it should be actually added? @andreyanufr @AlexanderDokuchaev @l-bat FYI

Thank you @MaximProshin for the question! I went through the original paper (Badri & Shaji, Mobius 2023) and wanted to share the concrete results.

Perplexity on WikiText-2 (Llama-2 family)

Method Bits Group size Data-free Llama-2-7B PPL ↓ Llama-2-13B PPL ↓ Llama-2-70B PPL ↓
FP16 16 5.18 4.63 OOM
GPTQ 4 g128 5.41 4.74 3.24
AWQ 4 g64 5.28 4.70 3.20
HQQ 4 g128 5.35 4.74 3.21
HQQ 4 g64 5.30 4.70 3.19

HQQ matches GPTQ and AWQ accuracy at INT4 without any calibration data.

Quantization Speed (Llama-2-70B, A100 SXM4)

Method Time
GPTQ 215 min
AWQ 105 min
HQQ 4 min

HQQ is >50x faster than GPTQ on the largest model.

Why it works without data

Unlike GPTQ/AWQ which minimize layer output error using calibration samples, HQQ directly minimizes weight reconstruction error using a sparsity-promoting lp<1 norm solved via a closed-form Half-Quadratic solver , no gradients, no calibration dataset needed.

I believe this makes a strong case for including it in NNCF as a data-free alternative that delivers calibration-quality results. Happy to discuss further if needed!

@MaximProshin
Copy link
Copy Markdown
Collaborator

MaximProshin commented Mar 13, 2026

@abhayuvi , please collect and share results from your PR, not from the paper! Please also collect it for LLMs from different families, not just Llama.

@abhayuvi
Copy link
Copy Markdown
Author

abhayuvi commented Mar 13, 2026

@abhayuvi , please collect and share results from your PR, not from the paper! Please also collect it for LLMs from different families, not just Llama.

Sure @MaximProshin , i collected the benchmark results from my PR! Here are perplexity results from the actual NNCF implementation comparing RTN vs HQQ on WikiText-2.

Benchmark Setup

  • Mode: INT4_ASYM, group_size=64
  • Metric: Perplexity on WikiText-2 test split (lower is better)
  • Samples: 30 × 512 tokens
  • Backend: OpenVINO (OVModelForCausalLM, exported with use_cache=False)

Results

Model Architecture RTN PPL ↓ HQQ PPL ↓ Δ (RTN − HQQ)
TinyLlama-1.1B-Chat Llama 10.6545 10.6523 +0.0022
Qwen2-1.5B Qwen 11.9888 11.9911 ≈ tie
Phi-1.5 Phi (Microsoft) 33.3156 33.3022 +0.0134

Observations

The gains at the 1–2B scale are modest but consistent with expectations , the HQQ paper reports progressively larger improvements at 7B and beyond, where the quantization error is more significant relative to model capacity.

Unfortunately, I was unable to benchmark on 7B+ models (e.g. Llama-3-7B, Mistral-7B) due to hardware constraints on my development machine , exporting and running inference on a 7B model requires substantially more RAM and VRAM than is available here.
If the reviewers are able to run the benchmark on larger models, the improvements should be more pronounced.

Bug Fix Included

During benchmarking I discovered and fixed a correctness bug in the HQQ implementation:
the float zero_point optimized during HQQ iterations was not being rounded before returning. This caused a mismatch between:

  • the float zp used during quantization, and
  • the integer zp stored by the OV backend (uint4)

This was causing catastrophic quantization error in practice. The fix rounds and clips zero_point to the valid integer range at the end of _calculate_hqq_params via a new _round_zero_point() method. The test suite has been updated to reflect and verify this corrected behavior.

Happy to discuss further if needed!

@abhayuvi
Copy link
Copy Markdown
Author

abhayuvi commented Apr 6, 2026

Hi, just wanted to check in on this PR. Could you let me know if the implementation looks good or if there are any changes needed? Happy to address any feedback right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NNCF OpenVINO Pull requests that updates NNCF OpenVINO

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants