Skip to content

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834

Merged
guschmue merged 3 commits intomainfrom
webgpu-matmulnbits-step1-correct
Mar 30, 2026
Merged

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834
guschmue merged 3 commits intomainfrom
webgpu-matmulnbits-step1-correct

Conversation

@qjia7
Copy link
Copy Markdown
Contributor

@qjia7 qjia7 commented Mar 25, 2026

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth.

Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics.

Changes:

  • matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor.
  • matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

@qjia7 qjia7 marked this pull request as ready for review March 25, 2026 07:51
@qjia7 qjia7 requested a review from Copilot March 25, 2026 07:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the WebGPU MatMulNBits default shader variant to increase K-reduction parallelism on non-Intel GPUs by making tile_size_k_vec configurable and selecting a larger default for better throughput.

Changes:

  • Add a tile_size_k_vec parameter (default 16) to MatMulNBitsProgram so K-parallelism can be tuned per device.
  • Use tile_size_k_vec = 32 for non-Intel adapters and keep 16 for Intel adapters when constructing the default MatMulNBits program.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.h Extends MatMulNBitsProgram to store a configurable tile_size_k_vec_ used during shader generation.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Plumbs tile_size_k_vec_ into WGSL template parameters and selects 16 vs 32 based on adapter vendor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel,
doubling the number of threads working on K-dimension reduction per
output row. This improves token generation throughput by ~3% on NVIDIA
GPUs by better utilizing memory bandwidth.

Intel devices retain tile_size_k_vec=16 due to different subgroup and
cache characteristics.

Changes:
- matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to
  MatMulNBitsProgram constructor.
- matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors,
  pass to program constructor.
@qjia7 qjia7 force-pushed the webgpu-matmulnbits-step1-correct branch from 183b6e3 to 304383d Compare March 25, 2026 08:00
guschmue
guschmue previously approved these changes Mar 25, 2026
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Mar 25, 2026
@guschmue
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@qjia7 qjia7 force-pushed the webgpu-matmulnbits-step1-correct branch from 952a4de to aada144 Compare March 30, 2026 01:27
tile_size_k_vec varies by GPU vendor (32 for NVIDIA, 16 for Intel).
Without it in CacheHint, a cached shader compiled with one value
could be incorrectly reused when tile_size_k_vec changes, producing
wrong results or suboptimal performance.
@guschmue guschmue merged commit 358628a into main Mar 30, 2026
105 of 109 checks passed
@guschmue guschmue deleted the webgpu-matmulnbits-step1-correct branch March 30, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants