Skip to content

Enable CPU Optimizer Support for bitsandbytes#1901

Open
jiqing-feng wants to merge 14 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:cpu
Open

Enable CPU Optimizer Support for bitsandbytes#1901
jiqing-feng wants to merge 14 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:cpu

Conversation

@jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Mar 18, 2026

Summary

This PR enables all bitsandbytes optimizers (32-bit and 8-bit blockwise) to run on CPU. Previously optimizers were restricted to CUDA/XPU only.

Motivation

Users on CPU-only machines had to fall back to vanilla PyTorch optimizers, losing the benefits of 8-bit state compression. This PR removes that limitation.

Changes

Python CPU Kernels (bitsandbytes/backends/cpu/ops.py)

  • Implemented _optimizer_update_32bit_cpu (Adam, AdEMAMix, LAMB/LARS, Lion, SGD, RMSprop)
  • Implemented _optimizer_update_8bit_blockwise_cpu with blockwise quantization/dequantization
  • Fixed AdEMAMix m1/m2 interleaved state layout

Optimizer Framework (bitsandbytes/optim/optimizer.py, bitsandbytes/functional.py)

  • get_state_buffer: CPU uses regular tensors; paged optimizers fall back to non-paged with warning
  • to_gpu: skips CPU parameters
  • is_on_gpu: accepts all-CPU tensor sets, rejects mixed CPU/GPU

Native C++ Kernels (csrc/cpu_ops.cpp, csrc/cpu_ops.h, csrc/pythonInterface.cpp)

  • Replaced per-element binary search with LUT-based quantization (4-slot cached LUT with content fingerprinting)
  • Added quantize_cpu_bf16 / quantize_cpu_fp16 for direct half-precision quantization
  • Fixed 8-bit blockwise kernel for bf16/fp16 inputs

Tests (tests/test_optim.py)

  • Removed no_cpu=True filters — all optimizer tests now run on CPU
  • Paged optimizer variants auto-skip on CPU

Example (examples/cpu/cpu_training.py)

  • End-to-end CPU training with JackFram/llama-68m + Alpaca, supporting multiple optimizers, --compare mode, and HF Trainer integration

Supported Optimizers on CPU

Optimizer 32-bit 8-bit Blockwise
Adam / AdamW
SGD (Momentum)
Lion
RMSprop
LARS
LAMB
AdEMAMix

Note: Paged variants (e.g., PagedAdamW) fall back to non-paged on CPU. LARS/LAMB 8-bit blockwise is not available upstream.

How to Test

pytest tests/test_optim.py -x -v -k "cpu"
python examples/cpu/cpu_training.py --optimizer adamw8bit --steps 20

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng marked this pull request as draft March 18, 2026 02:25
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng marked this pull request as ready for review March 18, 2026 02:41
@jiqing-feng
Copy link
Contributor Author

Hi @matthewdouglas
Please review this PR. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@matthewdouglas matthewdouglas added this to the v0.50.0 milestone Mar 18, 2026
@matthewdouglas matthewdouglas added Intel x64 CPU Optimizers Issues or feature requests relating to optimizers labels Mar 18, 2026
@github-actions
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng
Copy link
Contributor Author

Hi @matthewdouglas . The failed tests seems like the CI node does not have avx512. Please rerun the CI to see if I fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Intel Optimizers Issues or feature requests relating to optimizers x64 CPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants