webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 by qjia7 · Pull Request #27834 · microsoft/onnxruntime

qjia7 · 2026-03-25T07:18:07Z

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth.

Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics.

Changes:

matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor.
matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

Copilot

Pull request overview

Updates the WebGPU MatMulNBits default shader variant to increase K-reduction parallelism on non-Intel GPUs by making tile_size_k_vec configurable and selecting a larger default for better throughput.

Changes:

Add a tile_size_k_vec parameter (default 16) to MatMulNBitsProgram so K-parallelism can be tuned per device.
Use tile_size_k_vec = 32 for non-Intel adapters and keep 16 for Intel adapters when constructing the default MatMulNBits program.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.h`	Extends `MatMulNBitsProgram` to store a configurable `tile_size_k_vec_` used during shader generation.
`onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc`	Plumbs `tile_size_k_vec_` into WGSL template parameters and selects 16 vs 32 based on adapter vendor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth. Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics. Changes: - matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor. - matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

guschmue · 2026-03-27T21:07:14Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-27T21:07:33Z

Azure Pipelines successfully started running 4 pipeline(s).

tile_size_k_vec varies by GPU vendor (32 for NVIDIA, 16 for Intel). Without it in CacheHint, a cached shader compiled with one value could be incorrectly reused when tile_size_k_vec changes, producing wrong results or suboptimal performance.

qjia7 marked this pull request as ready for review March 25, 2026 07:51

qjia7 requested a review from Copilot March 25, 2026 07:52

Copilot started reviewing on behalf of qjia7 March 25, 2026 07:53 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

qjia7 force-pushed the webgpu-matmulnbits-step1-correct branch from 183b6e3 to 304383d Compare March 25, 2026 08:00

guschmue previously approved these changes Mar 25, 2026

View reviewed changes

guschmue added the ep:WebGPU ort-web webgpu provider label Mar 25, 2026

qjia7 dismissed guschmue’s stale review via 952a4de March 27, 2026 08:42

Merge branch 'main' into webgpu-matmulnbits-step1-correct

aada144

qjia7 force-pushed the webgpu-matmulnbits-step1-correct branch from 952a4de to aada144 Compare March 30, 2026 01:27

guschmue approved these changes Mar 30, 2026

View reviewed changes

guschmue merged commit 358628a into main Mar 30, 2026
105 of 109 checks passed

guschmue deleted the webgpu-matmulnbits-step1-correct branch March 30, 2026 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834
guschmue merged 3 commits intomainfrom
webgpu-matmulnbits-step1-correct

qjia7 commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

guschmue commented Mar 27, 2026

Uh oh!

azure-pipelines bot commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qjia7 commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

guschmue commented Mar 27, 2026

Uh oh!

azure-pipelines bot commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants