[MLAS] Add depthwise with multiplier conv special kernel for NCHW data layout on Avx512#27874
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a specialized MLAS convolution path targeting an exact depthwise-with-multiplier grouped Conv2D shape family (MobileCLIP projection) and wires it up to an AVX512F kernel for NCHW/CHW slices, along with benchmark and unit-test coverage for the target shapes.
Changes:
- Introduces a new AVX512F kernel and dispatch path for depthwise multiplier=2, 7x7, stride=2, pad=3 grouped conv in NCHW.
- Adds MobileCLIP-shaped test registrations and benchmark cases.
- Adds new MLAS algorithm enum value and selection/threading logic in
convolve.cpp.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/mlas/unittest/test_conv2d_fixture.h | Adds exact MobileCLIP depthwise-multiplier-2 grouped conv shapes to the short-execute test matrix. |
| onnxruntime/test/mlas/bench/bench_sconv.cpp | Adds a new “MobileClip” benchmark configuration for the 7x7 grouped depthwise-multiplier-2 cases. |
| onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp | Removes the NEON NCHWc kernel implementation source file. |
| onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_greater_than_1.cpp | Adds a narrow entrypoint for the depth-multiplier>1 (currently multiplier=2) MobileCLIP shape family on AMD64 AVX512F. |
| onnxruntime/core/mlas/lib/sconv_nchw_depthwise_multiplier_1.cpp | Adds a depthwise-multiplier-1 (3x3, stride 1, pad≤1) CHW kernel entrypoint. |
| onnxruntime/core/mlas/lib/mlasi.h | Adds prototypes for the new depthwise-with-multiplier entrypoint and AVX512F kernel. |
| onnxruntime/core/mlas/lib/intrinsics/avx512/sconv_nchw_depthwise_multiplier_greater_than_1_avx512f.cpp | Implements the AVX512F kernel for the exact MobileCLIP multiplier=2 grouped projection case. |
| onnxruntime/core/mlas/lib/convolve.cpp | Adds algorithm selection, dispatch, and threaded execution for the new depthwise multiplier>1 path. |
| onnxruntime/core/mlas/inc/mlas.h | Extends MLAS_CONV_ALGORITHM with MlasConvAlgorithmDepthwiseMultiplierGreaterThan1. |
| cmake/onnxruntime_mlas.cmake | Adds new source files to builds (including AVX512 and ARM64 lists). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
SummaryThe implementation is nicely scoped: dispatch is constrained to the exact MobileClip depthwise-multiplier-2 shape family, the AVX512 kernel keeps the hot path allocation-free, and the PR adds both MLAS and provider-level coverage for the target shapes. I did not find a correctness blocker in the new dispatch or kernel logic. The main remaining gap is test coverage for the new kernel's custom Review1. Specialized MLAS dispatch and AVX512 kernel (
|
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | Suggestion | Regression coverage | The new specialized kernel's beta/post-conv path is implemented but not exercised by the added tests. |
Verdict
COMMENT — the implementation itself looks sound, but I would like one targeted regression test for the new beta/activation path before calling coverage complete.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
APPROVE — I did not find a blocking implementation issue; the only remaining gap is making AVX512-path coverage deterministic in CI.
AVX512 coverage is not deterministic: the new tests validate the MobileClip shapes, but they still rely on whatever runtime kernel MlasConvPrepare() selects. On runners without AVX512 support, these tests will pass via the generic convolution path, which means the new MlasConvAlgorithmDepthwiseMultiplierGreaterThan1 dispatch and MlasConvDepthwiseMultiplier2CHWKernel7x7S2Avx512F() code can merge without actually executing in CI. Consider adding an AVX512-gated unit that explicitly checks the selected algorithm or directly exercises the specialized kernel when GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F.
if (GetMlasPlatform().ConvNchwFloatKernel == MlasConvNchwFloatKernelAvx512F) {
MLAS_CONV_PARAMETERS parameters;
size_t working_buffer_size = 0;
MlasConvPrepare(¶meters, /* ... exact MobileClip shape ... */);
ASSERT_EQ(parameters.Algorithm, MlasConvAlgorithmDepthwiseMultiplierGreaterThan1);
}
Thanks for this. I added a dispatch check for AVX512. |
Description
Adds a special AVX512 kernel for depthwise conv with multiplier = 2. These improve the performance of 3 costly conv operations (7x7 kernels) in the MobileClip model by approx 2.4x (will share MLAS benchmark numbers).
These are 3 ops with
These Conv operations cannot be dispateched to NCHWc as the Cout per group is sub-block size. On AVX512, the block size is 16 and the Cout per group is only 2. There is a special depthwise kernel in the NCHWc suite but it can only handle Cout per group = 1.
MLAS Benchmark Before and After comparison:
Motivation and Context
Just by optimizing these 3 conv operations, MobileClip is about 700us-850us faster and the entire model is <14ms on an AVX512 machine.