Skip to content

[MLAS] Add depthwise with multiplier conv special kernel for NCHW data layout on Avx512#27874

Merged
hariharans29 merged 22 commits intomainfrom
hari/nchw_depthwise_with_multiplier
Apr 1, 2026
Merged

[MLAS] Add depthwise with multiplier conv special kernel for NCHW data layout on Avx512#27874
hariharans29 merged 22 commits intomainfrom
hari/nchw_depthwise_with_multiplier

Conversation

@hariharans29
Copy link
Copy Markdown
Member

@hariharans29 hariharans29 commented Mar 27, 2026

Description

Adds a special AVX512 kernel for depthwise conv with multiplier = 2. These improve the performance of 3 costly conv operations (7x7 kernels) in the MobileClip model by approx 2.4x (will share MLAS benchmark numbers).

These are 3 ops with

  1. Cin=64, Cout=128, group=64, H=64, W=64, kH=7, kW=7
  2. Cin=128, Cout=256, group=128, H=32, W=32, kH=7, kW=7
  3. Cin=256, Cout=512, group=256, H=16, W=16, kH=7, kW=7

These Conv operations cannot be dispateched to NCHWc as the Cout per group is sub-block size. On AVX512, the block size is 16 and the Cout per group is only 2. There is a special depthwise kernel in the NCHWc suite but it can only handle Cout per group = 1.

MLAS Benchmark Before and After comparison:

Benchmark BEFORE mean (ns) AFTER mean (ns) Speedup
SCONV_NCHW G64 3,151,190 1,391,419 2.26x
SCONV_NCHW G128 1,646,040 824,654 2.00x
SCONV_NCHW G256 978,843 533,375 1.84x
SCONV_NCHW_THREADED G64 873,283 367,722 2.37x
SCONV_NCHW_THREADED G128 445,786 226,777 1.97x
SCONV_NCHW_THREADED G256 264,473 147,997 1.79x

Motivation and Context

Just by optimizing these 3 conv operations, MobileClip is about 700us-850us faster and the entire model is <14ms on an AVX512 machine.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a specialized MLAS convolution path targeting an exact depthwise-with-multiplier grouped Conv2D shape family (MobileCLIP projection) and wires it up to an AVX512F kernel for NCHW/CHW slices, along with benchmark and unit-test coverage for the target shapes.

Changes:

  • Introduces a new AVX512F kernel and dispatch path for depthwise multiplier=2, 7x7, stride=2, pad=3 grouped conv in NCHW.
  • Adds MobileCLIP-shaped test registrations and benchmark cases.
  • Adds new MLAS algorithm enum value and selection/threading logic in convolve.cpp.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
onnxruntime/test/mlas/unittest/test_conv2d_fixture.h Adds exact MobileCLIP depthwise-multiplier-2 grouped conv shapes to the short-execute test matrix.
onnxruntime/test/mlas/bench/bench_sconv.cpp Adds a new “MobileClip” benchmark configuration for the 7x7 grouped depthwise-multiplier-2 cases.
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp Removes the NEON NCHWc kernel implementation source file.
onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_greater_than_1.cpp Adds a narrow entrypoint for the depth-multiplier>1 (currently multiplier=2) MobileCLIP shape family on AMD64 AVX512F.
onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_1.cpp Adds a depthwise-multiplier-1 (3x3, stride 1, pad≤1) CHW kernel entrypoint.
onnxruntime/core/mlas/lib/mlasi.h Adds prototypes for the new depthwise-with-multiplier entrypoint and AVX512F kernel.
onnxruntime/core/mlas/lib/intrinsics/avx512/sconv_nchw_depthwise_multiplier_greater_than_1_avx512f.cpp Implements the AVX512F kernel for the exact MobileCLIP multiplier=2 grouped projection case.
onnxruntime/core/mlas/lib/convolve.cpp Adds algorithm selection, dispatch, and threaded execution for the new depthwise multiplier>1 path.
onnxruntime/core/mlas/inc/mlas.h Extends MLAS_CONV_ALGORITHM with MlasConvAlgorithmDepthwiseMultiplierGreaterThan1.
cmake/onnxruntime_mlas.cmake Adds new source files to builds (including AVX512 and ARM64 lists).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmake/onnxruntime_mlas.cmake
Comment thread cmake/onnxruntime_mlas.cmake
Comment thread onnxruntime/core/mlas/lib/convolve.cpp
Comment thread onnxruntime/test/mlas/unittest/test_conv2d_fixture.h Outdated
Comment thread onnxruntime/test/mlas/bench/bench_sconv.cpp
hariharans29 and others added 4 commits March 26, 2026 19:30
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/mlas/bench/bench_sconv.cpp
Comment thread onnxruntime/core/mlas/lib/convolve.cpp
@hariharans29 hariharans29 requested a review from tianleiwu March 27, 2026 23:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/inc/mlas.h
@tianleiwu
Copy link
Copy Markdown
Contributor

Summary

The implementation is nicely scoped: dispatch is constrained to the exact MobileClip depthwise-multiplier-2 shape family, the AVX512 kernel keeps the hot path allocation-free, and the PR adds both MLAS and provider-level coverage for the target shapes. I did not find a correctness blocker in the new dispatch or kernel logic. The main remaining gap is test coverage for the new kernel's custom beta/post-conv path, which is implemented here but not exercised by the added tests.


Review

1. Specialized MLAS dispatch and AVX512 kernel (onnxruntime/core/mlas/lib/convolve.cpp)

Positive:

  • The dispatch is intentionally narrow and well-defended. Restricting selection to FilterCount == 2, InputChannels == 1, 7x7, stride 2, pad 3, dilation 1, and the AVX512 NCHW kernel avoids accidentally routing unrelated grouped convolutions through an under-generalized fast path.
  • The threaded execution model is also straightforward: partitioning over BatchCount * GroupCount matches the memory layout cleanly, and each worker operates on disjoint input/output slices with shared read-only filter and bias storage.

2. Regression coverage for the new custom path (onnxruntime/test/mlas/unittest/test_conv2d_fixture.h)

Positive:

  • The PR does add the three MobileClip target shapes to the MLAS short-execute coverage and adds provider-level Conv tests, so the main shape family is covered end-to-end.

Concern:

  • ⚠️ Missing beta/activation regression: the new AVX512 kernel and its scalar border helper both implement custom beta accumulation logic, and the dispatch path still applies MlasActivation afterwards, but the new tests only validate plain Conv outputs with fresh output buffers. That means a regression in the beta != 0 path or in the interaction with post-conv activation would likely slip through even though that logic is newly introduced in convolve.cpp, sconv_nchw_depthwise_multiplier_greater_than_1.cpp, and sconv_nchw_depthwise_multiplier_greater_than_1_avx512f.cpp.
    // Suggested follow-up: add an MLAS-level regression that pre-populates Output,
    // runs this path with beta = 1.0f, and verifies both interior and border cells.

Summary of Concerns

# Severity Component Issue
1 Suggestion Regression coverage The new specialized kernel's beta/post-conv path is implemented but not exercised by the added tests.

Verdict

COMMENT — the implementation itself looks sound, but I would like one targeted regression test for the new beta/activation path before calling coverage complete.

Comment thread onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_greater_than_1.cpp Outdated
Comment thread onnxruntime/test/mlas/unittest/test_conv2d_fixture.h
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h Outdated
Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h Outdated
Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h Outdated
Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h Outdated
Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h Outdated
Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h Outdated
Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h
hariharans29 and others added 6 commits March 30, 2026 19:34
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/mlas/unittest/test_conv2d.h
tianleiwu
tianleiwu previously approved these changes Mar 31, 2026
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE — I did not find a blocking implementation issue; the only remaining gap is making AVX512-path coverage deterministic in CI.

AVX512 coverage is not deterministic: the new tests validate the MobileClip shapes, but they still rely on whatever runtime kernel MlasConvPrepare() selects. On runners without AVX512 support, these tests will pass via the generic convolution path, which means the new MlasConvAlgorithmDepthwiseMultiplierGreaterThan1 dispatch and MlasConvDepthwiseMultiplier2CHWKernel7x7S2Avx512F() code can merge without actually executing in CI. Consider adding an AVX512-gated unit that explicitly checks the selected algorithm or directly exercises the specialized kernel when GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F.

if (GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F) {
  MLAS_CONV_PARAMETERS parameters;
  size_t working_buffer_size = 0;
  MlasConvPrepare(&parameters, /* ... exact MobileClip shape ... */);
  ASSERT_EQ(parameters.Algorithm, MlasConvAlgorithmDepthwiseMultiplierGreaterThan1);
}

@hariharans29
Copy link
Copy Markdown
Member Author

hariharans29 commented Apr 1, 2026

APPROVE — I did not find a blocking implementation issue; the only remaining gap is making AVX512-path coverage deterministic in CI.

AVX512 coverage is not deterministic: the new tests validate the MobileClip shapes, but they still rely on whatever runtime kernel MlasConvPrepare() selects. On runners without AVX512 support, these tests will pass via the generic convolution path, which means the new MlasConvAlgorithmDepthwiseMultiplierGreaterThan1 dispatch and MlasConvDepthwiseMultiplier2CHWKernel7x7S2Avx512F() code can merge without actually executing in CI. Consider adding an AVX512-gated unit that explicitly checks the selected algorithm or directly exercises the specialized kernel when GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F.

if (GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F) {
  MLAS_CONV_PARAMETERS parameters;
  size_t working_buffer_size = 0;
  MlasConvPrepare(&parameters, /* ... exact MobileClip shape ... */);
  ASSERT_EQ(parameters.Algorithm, MlasConvAlgorithmDepthwiseMultiplierGreaterThan1);
}

Thanks for this. I added a dispatch check for AVX512.

@hariharans29 hariharans29 enabled auto-merge (squash) April 1, 2026 05:14
@hariharans29 hariharans29 merged commit 57b265e into main Apr 1, 2026
103 of 105 checks passed
@hariharans29 hariharans29 deleted the hari/nchw_depthwise_with_multiplier branch April 1, 2026 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants