[MLAS] Add depthwise with multiplier conv special kernel for NCHW data layout on Avx512 by hariharans29 · Pull Request #27874 · microsoft/onnxruntime

hariharans29 · 2026-03-27T02:13:21Z

Description

Adds a special AVX512 kernel for depthwise conv with multiplier = 2. These improve the performance of 3 costly conv operations (7x7 kernels) in the MobileClip model by approx 2.4x (will share MLAS benchmark numbers).

These are 3 ops with

Cin=64, Cout=128, group=64, H=64, W=64, kH=7, kW=7
Cin=128, Cout=256, group=128, H=32, W=32, kH=7, kW=7
Cin=256, Cout=512, group=256, H=16, W=16, kH=7, kW=7

These Conv operations cannot be dispateched to NCHWc as the Cout per group is sub-block size. On AVX512, the block size is 16 and the Cout per group is only 2. There is a special depthwise kernel in the NCHWc suite but it can only handle Cout per group = 1.

MLAS Benchmark Before and After comparison:

Benchmark	BEFORE mean (ns)	AFTER mean (ns)	Speedup
SCONV_NCHW G64	3,151,190	1,391,419	2.26x
SCONV_NCHW G128	1,646,040	824,654	2.00x
SCONV_NCHW G256	978,843	533,375	1.84x
SCONV_NCHW_THREADED G64	873,283	367,722	2.37x
SCONV_NCHW_THREADED G128	445,786	226,777	1.97x
SCONV_NCHW_THREADED G256	264,473	147,997	1.79x

Motivation and Context

Just by optimizing these 3 conv operations, MobileClip is about 700us-850us faster and the entire model is <14ms on an AVX512 machine.

Copilot

Pull request overview

Adds a specialized MLAS convolution path targeting an exact depthwise-with-multiplier grouped Conv2D shape family (MobileCLIP projection) and wires it up to an AVX512F kernel for NCHW/CHW slices, along with benchmark and unit-test coverage for the target shapes.

Changes:

Introduces a new AVX512F kernel and dispatch path for depthwise multiplier=2, 7x7, stride=2, pad=3 grouped conv in NCHW.
Adds MobileCLIP-shaped test registrations and benchmark cases.
Adds new MLAS algorithm enum value and selection/threading logic in convolve.cpp.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
onnxruntime/test/mlas/unittest/test_conv2d_fixture.h	Adds exact MobileCLIP depthwise-multiplier-2 grouped conv shapes to the short-execute test matrix.
onnxruntime/test/mlas/bench/bench_sconv.cpp	Adds a new “MobileClip” benchmark configuration for the 7x7 grouped depthwise-multiplier-2 cases.
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp	Removes the NEON NCHWc kernel implementation source file.
onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_greater_than_1.cpp	Adds a narrow entrypoint for the depth-multiplier>1 (currently multiplier=2) MobileCLIP shape family on AMD64 AVX512F.
onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_1.cpp	Adds a depthwise-multiplier-1 (3x3, stride 1, pad≤1) CHW kernel entrypoint.
onnxruntime/core/mlas/lib/mlasi.h	Adds prototypes for the new depthwise-with-multiplier entrypoint and AVX512F kernel.
onnxruntime/core/mlas/lib/intrinsics/avx512/sconv_nchw_depthwise_multiplier_greater_than_1_avx512f.cpp	Implements the AVX512F kernel for the exact MobileCLIP multiplier=2 grouped projection case.
onnxruntime/core/mlas/lib/convolve.cpp	Adds algorithm selection, dispatch, and threaded execution for the new depthwise multiplier>1 path.
onnxruntime/core/mlas/inc/mlas.h	Extends `MLAS_CONV_ALGORITHM` with `MlasConvAlgorithmDepthwiseMultiplierGreaterThan1`.
cmake/onnxruntime_mlas.cmake	Adds new source files to builds (including AVX512 and ARM64 lists).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ultiplier

…on't run

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ultiplier

tianleiwu · 2026-03-31T00:01:38Z

Summary

The implementation is nicely scoped: dispatch is constrained to the exact MobileClip depthwise-multiplier-2 shape family, the AVX512 kernel keeps the hot path allocation-free, and the PR adds both MLAS and provider-level coverage for the target shapes. I did not find a correctness blocker in the new dispatch or kernel logic. The main remaining gap is test coverage for the new kernel's custom beta/post-conv path, which is implemented here but not exercised by the added tests.

Review

1. Specialized MLAS dispatch and AVX512 kernel (`onnxruntime/core/mlas/lib/convolve.cpp`)

Positive:

The dispatch is intentionally narrow and well-defended. Restricting selection to FilterCount == 2, InputChannels == 1, 7x7, stride 2, pad 3, dilation 1, and the AVX512 NCHW kernel avoids accidentally routing unrelated grouped convolutions through an under-generalized fast path.
The threaded execution model is also straightforward: partitioning over BatchCount * GroupCount matches the memory layout cleanly, and each worker operates on disjoint input/output slices with shared read-only filter and bias storage.

2. Regression coverage for the new custom path (`onnxruntime/test/mlas/unittest/test_conv2d_fixture.h`)

Positive:

The PR does add the three MobileClip target shapes to the MLAS short-execute coverage and adds provider-level Conv tests, so the main shape family is covered end-to-end.

Concern:

⚠️ Missing beta/activation regression: the new AVX512 kernel and its scalar border helper both implement custom beta accumulation logic, and the dispatch path still applies MlasActivation afterwards, but the new tests only validate plain Conv outputs with fresh output buffers. That means a regression in the beta != 0 path or in the interaction with post-conv activation would likely slip through even though that logic is newly introduced in convolve.cpp, sconv_nchw_depthwise_multiplier_greater_than_1.cpp, and sconv_nchw_depthwise_multiplier_greater_than_1_avx512f.cpp.
```
// Suggested follow-up: add an MLAS-level regression that pre-populates Output,
// runs this path with beta = 1.0f, and verifies both interior and border cells.
```

Summary of Concerns

#	Severity	Component	Issue
1	Suggestion	Regression coverage	The new specialized kernel's `beta`/post-conv path is implemented but not exercised by the added tests.

Verdict

COMMENT — the implementation itself looks sound, but I would like one targeted regression test for the new beta/activation path before calling coverage complete.

github-actions

You can commit the suggested changes from lintrunner.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ultiplier

tianleiwu

APPROVE — I did not find a blocking implementation issue; the only remaining gap is making AVX512-path coverage deterministic in CI.

AVX512 coverage is not deterministic: the new tests validate the MobileClip shapes, but they still rely on whatever runtime kernel MlasConvPrepare() selects. On runners without AVX512 support, these tests will pass via the generic convolution path, which means the new MlasConvAlgorithmDepthwiseMultiplierGreaterThan1 dispatch and MlasConvDepthwiseMultiplier2CHWKernel7x7S2Avx512F() code can merge without actually executing in CI. Consider adding an AVX512-gated unit that explicitly checks the selected algorithm or directly exercises the specialized kernel when GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F.

if (GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F) {
  MLAS_CONV_PARAMETERS parameters;
  size_t working_buffer_size = 0;
  MlasConvPrepare(&parameters, /* ... exact MobileClip shape ... */);
  ASSERT_EQ(parameters.Algorithm, MlasConvAlgorithmDepthwiseMultiplierGreaterThan1);
}

hariharans29 · 2026-04-01T00:11:50Z

APPROVE — I did not find a blocking implementation issue; the only remaining gap is making AVX512-path coverage deterministic in CI.

AVX512 coverage is not deterministic: the new tests validate the MobileClip shapes, but they still rely on whatever runtime kernel MlasConvPrepare() selects. On runners without AVX512 support, these tests will pass via the generic convolution path, which means the new MlasConvAlgorithmDepthwiseMultiplierGreaterThan1 dispatch and MlasConvDepthwiseMultiplier2CHWKernel7x7S2Avx512F() code can merge without actually executing in CI. Consider adding an AVX512-gated unit that explicitly checks the selected algorithm or directly exercises the specialized kernel when GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F.
if (GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F) {
  MLAS_CONV_PARAMETERS parameters;
  size_t working_buffer_size = 0;
  MlasConvPrepare(&parameters, /* ... exact MobileClip shape ... */);
  ASSERT_EQ(parameters.Algorithm, MlasConvAlgorithmDepthwiseMultiplierGreaterThan1);
}

Thanks for this. I added a dispatch check for AVX512.

hariharans29 added 2 commits March 26, 2026 19:11

Add depthwise with multiplier special kernel for NCHW on Avx512

3de3f8a

Adjust guard

7726d80

hariharans29 requested a review from Copilot March 27, 2026 02:15

Copilot started reviewing on behalf of hariharans29 March 27, 2026 02:16 View session

Cmake changes

21217f0

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Comment thread cmake/onnxruntime_mlas.cmake

Comment thread cmake/onnxruntime_mlas.cmake

Comment thread onnxruntime/core/mlas/lib/convolve.cpp

Comment thread onnxruntime/test/mlas/unittest/test_conv2d_fixture.h Outdated

Comment thread onnxruntime/test/mlas/bench/bench_sconv.cpp

hariharans29 and others added 4 commits March 26, 2026 19:30

Update onnxruntime/test/mlas/unittest/test_conv2d_fixture.h

b50360c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Revert

2631f5e

Fixes

dec775e

Fixes

86a5867

hariharans29 requested a review from Copilot March 27, 2026 22:59

Copilot started reviewing on behalf of hariharans29 March 27, 2026 23:03 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/bench/bench_sconv.cpp

Comment thread onnxruntime/core/mlas/lib/convolve.cpp

hariharans29 requested a review from tianleiwu March 27, 2026 23:51

hariharans29 added 2 commits March 27, 2026 18:13

Merge remote-tracking branch 'origin' into hari/nchw_depthwise_with_m…

a1f231e

…ultiplier

Add CPU EP test and disable the MLAS tests for NCHWc as these tests w…

bd8e9ef

…on't run

hariharans29 requested a review from Copilot March 28, 2026 05:14

Copilot started reviewing on behalf of hariharans29 March 28, 2026 05:18 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Comment thread ...e/core/mlas/lib/intrinsics/avx512/sconv_nchw_depthwise_multiplier_greater_than_1_avx512f.cpp

hariharans29 added 2 commits March 27, 2026 22:42

More changes

5b71b7e

Cleanup

9cb75e3

hariharans29 requested a review from Copilot March 28, 2026 05:47

Copilot started reviewing on behalf of hariharans29 March 28, 2026 05:50 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/inc/mlas.h

Comment thread onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_greater_than_1.cpp

Merge remote-tracking branch 'origin' into hari/nchw_depthwise_with_m…

a41549c

…ultiplier

vraspar reviewed Mar 31, 2026

View reviewed changes

Comment thread ...e/core/mlas/lib/intrinsics/avx512/sconv_nchw_depthwise_multiplier_greater_than_1_avx512f.cpp Outdated

Comment thread onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_greater_than_1.cpp Outdated

Comment thread onnxruntime/test/mlas/unittest/test_conv2d_fixture.h

PR comments

2229e84

hariharans29 requested a review from Copilot March 31, 2026 02:31

Copilot started reviewing on behalf of hariharans29 March 31, 2026 02:32 View session

github-actions bot reviewed Mar 31, 2026

View reviewed changes

hariharans29 and others added 6 commits March 30, 2026 19:34

Update onnxruntime/test/mlas/unittest/test_conv2d.h

cc6f9c2

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/test/mlas/unittest/test_conv2d.h

cdc8819

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/test/mlas/unittest/test_conv2d.h

9e2675f

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/test/mlas/unittest/test_conv2d.h

49b280f

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/test/mlas/unittest/test_conv2d.h

327547c

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Format

0dd62ec

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h

hariharans29 and others added 2 commits March 30, 2026 19:40

Copilot comments

7709ca6

Merge remote-tracking branch 'origin' into hari/nchw_depthwise_with_m…

3410729

…ultiplier

tianleiwu previously approved these changes Mar 31, 2026

View reviewed changes

PR comment

4bff970

hariharans29 dismissed tianleiwu’s stale review via 4bff970 April 1, 2026 00:11

tianleiwu approved these changes Apr 1, 2026

View reviewed changes

hariharans29 enabled auto-merge (squash) April 1, 2026 05:14

hariharans29 merged commit 57b265e into main Apr 1, 2026
103 of 105 checks passed

hariharans29 deleted the hari/nchw_depthwise_with_multiplier branch April 1, 2026 07:01

Conversation

hariharans29 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Mar 31, 2026

Summary

Review

1. Specialized MLAS dispatch and AVX512 kernel (onnxruntime/core/mlas/lib/convolve.cpp)

2. Regression coverage for the new custom path (onnxruntime/test/mlas/unittest/test_conv2d_fixture.h)

Summary of Concerns

Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hariharans29 commented Mar 27, 2026 •

edited

Loading

1. Specialized MLAS dispatch and AVX512 kernel (`onnxruntime/core/mlas/lib/convolve.cpp`)

2. Regression coverage for the new custom path (`onnxruntime/test/mlas/unittest/test_conv2d_fixture.h`)

hariharans29 commented Apr 1, 2026 •

edited

Loading