From bf49c43c7a38265ab518244f8415d5df2c6b26a7 Mon Sep 17 00:00:00 2001 From: RJ Ascani Date: Fri, 20 Mar 2026 09:46:19 -0700 Subject: [PATCH 1/2] Cortex-M: Add backend documentation to docs site Adds the Cortex-M backend overview page to the ExecuTorch documentation website, making it discoverable alongside other embedded backends. The page covers target support, CMSIS-NN operator table, quantization, and a tutorial walking through export, quantization, edge lowering, and cross-compilation. Co-authored-by: Claude --- docs/source/backends-overview.md | 2 + .../arm-cortex-m/arm-cortex-m-overview.md | 157 ++++++++++++++++++ docs/source/embedded-arm-cortex-m.md | 1 + docs/source/embedded-backends.md | 5 + 4 files changed, 165 insertions(+) create mode 100644 docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md create mode 100644 docs/source/embedded-arm-cortex-m.md diff --git a/docs/source/backends-overview.md b/docs/source/backends-overview.md index fc8ab1a0166..d1c48eb4032 100644 --- a/docs/source/backends-overview.md +++ b/docs/source/backends-overview.md @@ -28,6 +28,7 @@ Backends are the bridge between your exported model and the hardware it runs on. | [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs | | [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs | | [Arm Ethos-U](/backends/arm-ethos-u/arm-ethos-u-overview.md) | Embedded | NPU | Arm MCUs | +| [Arm Cortex-M](/backends/arm-cortex-m/arm-cortex-m-overview.md) | Embedded | CPU | Arm Cortex-M MCUs | | [Arm VGF](/backends/arm-vgf/arm-vgf-overview.md) | Android | GPU | Arm platforms | | [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs | | [NXP](backends/nxp/nxp-overview.md) | Embedded | NPU | NXP SoCs | @@ -59,6 +60,7 @@ backends/vulkan/vulkan-overview backends-qualcomm backends-mediatek backends/arm-ethos-u/arm-ethos-u-overview +backends/arm-cortex-m/arm-cortex-m-overview backends/arm-vgf/arm-vgf-overview build-run-openvino backends/nxp/nxp-overview diff --git a/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md new file mode 100644 index 00000000000..39790db9ed0 --- /dev/null +++ b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md @@ -0,0 +1,157 @@ +# Arm Cortex-M Backend + +The Arm® Cortex®-M backend accelerates quantized model execution on Arm Cortex-M CPUs using [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) optimized kernels. Unlike delegate-based backends, it operates as an operator library: quantized subgraphs are replaced with CMSIS-NN accelerated kernels during the pass-lowering stage, while unsupported operators fall back to portable fp32 kernels. + +## Target Support + +The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants: + +| Variant | Description | Example CPUs | +|---------|-------------|--------------| +| MVE (Helium) | Vector extensions for Arm-M | Cortex-M55, Cortex-M85 | +| DSP | DSP extension instructions | Cortex-M4, Cortex-M7, Cortex-M33 | +| Pure C | Reference C implementation | Any Cortex-M | + +Testing has only been done with MVE targets (Cortex-M55, Cortex-M85). DSP and pure C CMSIS-NN kernels might work as well since the same CMSIS-NN API is used across all variants, but is unverified at this point. + +## CMSIS-NN Supported Operators + +| Operator | 8w8a | 8w16a | 4w8a | +|---|---|---|---| +| Conv2D | ✅ | ⬜ | ⬜ | +| DepthwiseConv2D | ✅ | ⬜ | ⬜ | +| TransposeConv2D | ✅ | ⬜ | ⬜ | +| Fully Connected | ✅ | ⬜ | ⬜ | +| Batch Matmul | ✅ | ⬜ | ⬜ | +| Add | ✅ | ⬜ | N/A | +| Mul | ✅ | ⬜ | N/A | +| MaxPooling | ✅ | ⬜ | N/A | +| AvgPooling | ✅ | ⬜ | N/A | +| Softmax | ✅ | ⬜ | N/A | +| Pad | ✅ | ⬜ | N/A | +| LSTM | ⬜ | ⬜ | ⬜ | +| SVDF | ⬜ | ⬜ | ⬜ | + +## Quantization Support + +The Cortex-M backend currently implements **symmetric INT8 (8w8a)** quantization: +- **Per-channel** quantization for convolution operators. +- **Per-tensor** quantization for all other supported operators. +- **Shared quantization parameters** for data-movement operators (e.g. reshape, permute) to avoid unnecessary requantization. + +CMSIS-NN also supports INT4 weights with INT8 activations (4w8a) and INT8 weights with INT16 activations (8w16a), but the corresponding quantizer configuration and operator implementations are not yet integrated. + +## Tutorial + +### Prerequisites + +Install the ExecuTorch pip package: +```bash +./install_executorch.sh +``` + +For cross-compilation and running on simulated hardware: +- [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation. +- [Arm® Corstone™ SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) or [SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for simulation. + +:::{tip} +All cross-compilation tools can be downloaded and added to the path: +```bash +examples/arm/setup.sh --i-agree-to-the-contained-eula +source examples/arm/arm-scratch/setup_path.sh +``` +::: + +### 1. Export and quantize + +Export the model, then quantize using `CortexMQuantizer` with the PT2E quantization flow: + +```python +import torch +from torchvision.models import mobilenet_v2, MobileNet_V2_Weights +from executorch.backends.cortex_m.quantizer.quantizer import CortexMQuantizer +from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e + +model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() + +example_input = torch.randn(1, 3, 224, 224).to(memory_format=torch.channels_last) +exported_program = torch.export.export(model, (example_input,)) +graph_module = exported_program.module() + +quantizer = CortexMQuantizer() +prepared = prepare_pt2e(graph_module, quantizer) + +# Calibrate with representative data +for calibration_input in calibration_data: + prepared(calibration_input) + +quantized = convert_pt2e(prepared) +quantized_exported_program = torch.export.export(quantized, (example_input,)) +``` + +### 2. Lower to edge and apply Cortex-M passes + +Lower to the edge dialect with a custom `EdgeCompileConfig`, then run the `CortexMPassManager` to replace quantized subgraphs with CMSIS-NN operator implementations: + +```python +from executorch.exir import EdgeCompileConfig, ExecutorchBackendConfig, to_edge +from executorch.backends.cortex_m.passes.cortex_m_pass_manager import CortexMPassManager + +config = EdgeCompileConfig( + preserve_ops=[ + torch.ops.aten.linear.default, + torch.ops.aten.hardsigmoid.default, + torch.ops.aten.hardsigmoid_.default, + torch.ops.aten.hardswish.default, + torch.ops.aten.hardswish_.default, + ], + _check_ir_validity=False, + _core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default], +) + +edge_program_manager = to_edge(quantized_exported_program, compile_config=config) + +pass_manager = CortexMPassManager(edge_program_manager.exported_program()) +edge_program_manager._edge_programs["forward"] = pass_manager.transform() +``` + +### 3. Serialize to .pte + +```python +executorch_program = edge_program_manager.to_executorch( + config=ExecutorchBackendConfig(extract_delegate_segments=False) +) + +with open("model.pte", "wb") as f: + f.write(executorch_program.buffer) +``` + +### 4. Cross-compile and run + +Cross-compile the ExecuTorch runtime, Cortex-M kernels, and the example runner application. The first cmake invocation builds the ExecuTorch libraries for Arm baremetal. The second builds the [arm_executor_runner](https://github.com/pytorch/executorch/blob/main/examples/arm/executor_runner/) and links it against those libraries with the `.pte` model baked in. + +```bash +# Build ExecuTorch libraries for Arm baremetal +cmake --preset arm-baremetal \ + -DCMAKE_BUILD_TYPE=Release \ + -DEXECUTORCH_BUILD_DEVTOOLS=ON \ + -Bcmake-out-arm +cmake --build cmake-out-arm --target install -j$(nproc) + +# Build the executor runner, linking the .pte into the binary +cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \ + -DCMAKE_BUILD_TYPE=Release \ + -DET_PTE_FILE_PATH=$(pwd)/model.pte \ + -DTARGET_CPU=cortex-m55 \ + -Bbuild \ + examples/arm/executor_runner +cmake --build build -j$(nproc) -- arm_executor_runner +``` + +Run on a simulated Cortex-M target: + +```bash +backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u55-128 +``` + +For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the [Cortex-M MobileNetV2 notebook](https://github.com/pytorch/executorch/blob/main/examples/arm/cortex_m_mv2_example.ipynb). diff --git a/docs/source/embedded-arm-cortex-m.md b/docs/source/embedded-arm-cortex-m.md new file mode 100644 index 00000000000..5791e068cef --- /dev/null +++ b/docs/source/embedded-arm-cortex-m.md @@ -0,0 +1 @@ +```{include} backends/arm-cortex-m/arm-cortex-m-overview.md diff --git a/docs/source/embedded-backends.md b/docs/source/embedded-backends.md index 4ed7962ef42..147f6cfc151 100644 --- a/docs/source/embedded-backends.md +++ b/docs/source/embedded-backends.md @@ -7,6 +7,10 @@ Available hardware acceleration backends for embedded systems. - {doc}`embedded-cadence` — Cadence Xtensa DSP processors +## CPU Acceleration + +- {doc}`embedded-arm-cortex-m` — Arm Cortex-M CMSIS-NN acceleration + ## NPU Acceleration - {doc}`embedded-arm-ethos-u` — ARM Ethos-U NPU acceleration @@ -15,6 +19,7 @@ Available hardware acceleration backends for embedded systems. ```{toctree} :hidden: +embedded-arm-cortex-m embedded-cadence embedded-arm-ethos-u embedded-nxp From f0b671b56ca6a3042a6858a0b15ed23bf4beef95 Mon Sep 17 00:00:00 2001 From: RJ Ascani Date: Fri, 20 Mar 2026 10:58:18 -0700 Subject: [PATCH 2/2] Cortex-M: Improve operator table with ATen op and CMSIS-NN kernel columns Add ATen op to CMSIS-NN kernel mapping, Supported column for target variants, link to CMSIS-NN API docs, add missing operators (minimum, maximum, permute_copy), use plain .pte instead of .bpte, and align table columns for readability. Co-authored-by: Claude --- .../arm-cortex-m/arm-cortex-m-overview.md | 47 ++++++++++--------- 1 file changed, 26 insertions(+), 21 deletions(-) diff --git a/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md index 39790db9ed0..7e2cdf00f15 100644 --- a/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md +++ b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md @@ -6,31 +6,36 @@ The Arm® Cortex®-M backend accelerates quantized model execution on Arm The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants: -| Variant | Description | Example CPUs | -|---------|-------------|--------------| -| MVE (Helium) | Vector extensions for Arm-M | Cortex-M55, Cortex-M85 | -| DSP | DSP extension instructions | Cortex-M4, Cortex-M7, Cortex-M33 | -| Pure C | Reference C implementation | Any Cortex-M | +| Variant | Description | Example CPUs | Supported | +|--------------|-----------------------------|--------------------|-----------| +| MVE (Helium) | M-profile Vector extensions | Cortex-M55, M85 | ✅ | +| DSP | DSP extension instructions | Cortex-M4, M7, M33 | ⬜ | +| Pure C | Reference C implementation | Any Cortex-M | ⬜ | -Testing has only been done with MVE targets (Cortex-M55, Cortex-M85). DSP and pure C CMSIS-NN kernels might work as well since the same CMSIS-NN API is used across all variants, but is unverified at this point. +DSP and pure C variants use the same CMSIS-NN API and may work, but have not been tested. ## CMSIS-NN Supported Operators -| Operator | 8w8a | 8w16a | 4w8a | -|---|---|---|---| -| Conv2D | ✅ | ⬜ | ⬜ | -| DepthwiseConv2D | ✅ | ⬜ | ⬜ | -| TransposeConv2D | ✅ | ⬜ | ⬜ | -| Fully Connected | ✅ | ⬜ | ⬜ | -| Batch Matmul | ✅ | ⬜ | ⬜ | -| Add | ✅ | ⬜ | N/A | -| Mul | ✅ | ⬜ | N/A | -| MaxPooling | ✅ | ⬜ | N/A | -| AvgPooling | ✅ | ⬜ | N/A | -| Softmax | ✅ | ⬜ | N/A | -| Pad | ✅ | ⬜ | N/A | -| LSTM | ⬜ | ⬜ | ⬜ | -| SVDF | ⬜ | ⬜ | ⬜ | +The backend pass pipeline replaces quantized ATen operators with [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) kernel calls. See the [CMSIS-NN API documentation](https://arm-software.github.io/CMSIS-NN/latest/modules.html) for the full list of available kernels. + +| ATen Op | CMSIS-NN Kernel | 8w8a | 8w16a | 4w8a | +|--------------------------------|------------------------|------|-------|------| +| `aten.convolution` | `arm_convolve` | ✅ | ⬜ | ⬜ | +| `aten.convolution` (depthwise) | `arm_depthwise_conv` | ✅ | ⬜ | ⬜ | +| `aten.convolution` (transposed)| `arm_transpose_conv` | ✅ | ⬜ | ⬜ | +| `aten.linear` | `arm_fully_connected` | ✅ | ⬜ | ⬜ | +| `aten.bmm` | `arm_batch_matmul` | ✅ | ⬜ | ⬜ | +| `aten.add` | `arm_elementwise_add` | ✅ | ⬜ | N/A | +| `aten.mul` | `arm_elementwise_mul` | ✅ | ⬜ | N/A | +| `aten.max_pool2d` | `arm_max_pool` | ✅ | ⬜ | N/A | +| `aten.avg_pool2d` | `arm_avgpool` | ✅ | ⬜ | N/A | +| `aten._softmax` | `arm_softmax` | ✅ | ⬜ | N/A | +| `aten.minimum` | `arm_minimum` | ✅ | ⬜ | N/A | +| `aten.maximum` | `arm_maximum` | ✅ | ⬜ | N/A | +| `aten.permute_copy` | `arm_transpose` | ✅ | ⬜ | N/A | +| `aten.constant_pad_nd` | `arm_pad` | ✅ | ⬜ | N/A | +| — | LSTM | ⬜ | ⬜ | ⬜ | +| — | SVDF | ⬜ | ⬜ | ⬜ | ## Quantization Support