From bf49c43c7a38265ab518244f8415d5df2c6b26a7 Mon Sep 17 00:00:00 2001
From: RJ Ascani <rja@meta.com>
Date: Fri, 20 Mar 2026 09:46:19 -0700
Subject: [PATCH 1/2] Cortex-M: Add backend documentation to docs site

Adds the Cortex-M backend overview page to the ExecuTorch documentation
website, making it discoverable alongside other embedded backends. The
page covers target support, CMSIS-NN operator table, quantization, and
a tutorial walking through export, quantization, edge lowering, and
cross-compilation.

Co-authored-by: Claude <noreply@anthropic.com>
---
 docs/source/backends-overview.md              |   2 +
 .../arm-cortex-m/arm-cortex-m-overview.md     | 157 ++++++++++++++++++
 docs/source/embedded-arm-cortex-m.md          |   1 +
 docs/source/embedded-backends.md              |   5 +
 4 files changed, 165 insertions(+)
 create mode 100644 docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md
 create mode 100644 docs/source/embedded-arm-cortex-m.md

diff --git a/docs/source/backends-overview.md b/docs/source/backends-overview.md
index fc8ab1a0166..d1c48eb4032 100644
--- a/docs/source/backends-overview.md
+++ b/docs/source/backends-overview.md
@@ -28,6 +28,7 @@ Backends are the bridge between your exported model and the hardware it runs on.
 | [Qualcomm](backends-qualcomm)                                | Android     | NPU           | Qualcomm SoCs                   |
 | [MediaTek](backends-mediatek)                                | Android     | NPU           | MediaTek SoCs                   |
 | [Arm Ethos-U](/backends/arm-ethos-u/arm-ethos-u-overview.md) | Embedded    | NPU           | Arm MCUs                        |
+| [Arm Cortex-M](/backends/arm-cortex-m/arm-cortex-m-overview.md) | Embedded | CPU          | Arm Cortex-M MCUs               |
 | [Arm VGF](/backends/arm-vgf/arm-vgf-overview.md)             | Android     | GPU           | Arm platforms                   |
 | [OpenVINO](build-run-openvino)                               | Embedded    | CPU/GPU/NPU   | Intel SoCs                      |
 | [NXP](backends/nxp/nxp-overview.md)                          | Embedded    | NPU           | NXP SoCs                        |
@@ -59,6 +60,7 @@ backends/vulkan/vulkan-overview
 backends-qualcomm
 backends-mediatek
 backends/arm-ethos-u/arm-ethos-u-overview
+backends/arm-cortex-m/arm-cortex-m-overview
 backends/arm-vgf/arm-vgf-overview
 build-run-openvino
 backends/nxp/nxp-overview
diff --git a/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md
new file mode 100644
index 00000000000..39790db9ed0
--- /dev/null
+++ b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md
@@ -0,0 +1,157 @@
+# Arm Cortex-M Backend
+
+The Arm&reg; Cortex&reg;-M backend accelerates quantized model execution on Arm Cortex-M CPUs using [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) optimized kernels. Unlike delegate-based backends, it operates as an operator library: quantized subgraphs are replaced with CMSIS-NN accelerated kernels during the pass-lowering stage, while unsupported operators fall back to portable fp32 kernels.
+
+## Target Support
+
+The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants:
+
+| Variant | Description | Example CPUs |
+|---------|-------------|--------------|
+| MVE (Helium) | Vector extensions for Arm-M | Cortex-M55, Cortex-M85 |
+| DSP | DSP extension instructions | Cortex-M4, Cortex-M7, Cortex-M33 |
+| Pure C | Reference C implementation | Any Cortex-M |
+
+Testing has only been done with MVE targets (Cortex-M55, Cortex-M85). DSP and pure C CMSIS-NN kernels might work as well since the same CMSIS-NN API is used across all variants, but is unverified at this point.
+
+## CMSIS-NN Supported Operators
+
+| Operator | 8w8a | 8w16a | 4w8a |
+|---|---|---|---|
+| Conv2D | ✅ | ⬜ | ⬜ |
+| DepthwiseConv2D | ✅ | ⬜ | ⬜ |
+| TransposeConv2D | ✅ | ⬜ | ⬜ |
+| Fully Connected | ✅ | ⬜ | ⬜ |
+| Batch Matmul | ✅ | ⬜ | ⬜ |
+| Add | ✅ | ⬜ | N/A |
+| Mul | ✅ | ⬜ | N/A |
+| MaxPooling | ✅ | ⬜ | N/A |
+| AvgPooling | ✅ | ⬜ | N/A |
+| Softmax | ✅ | ⬜ | N/A |
+| Pad | ✅ | ⬜ | N/A |
+| LSTM | ⬜ | ⬜ | ⬜ |
+| SVDF | ⬜ | ⬜ | ⬜ |
+
+## Quantization Support
+
+The Cortex-M backend currently implements **symmetric INT8 (8w8a)** quantization:
+- **Per-channel** quantization for convolution operators.
+- **Per-tensor** quantization for all other supported operators.
+- **Shared quantization parameters** for data-movement operators (e.g. reshape, permute) to avoid unnecessary requantization.
+
+CMSIS-NN also supports INT4 weights with INT8 activations (4w8a) and INT8 weights with INT16 activations (8w16a), but the corresponding quantizer configuration and operator implementations are not yet integrated.
+
+## Tutorial
+
+### Prerequisites
+
+Install the ExecuTorch pip package:
+```bash
+./install_executorch.sh
+```
+
+For cross-compilation and running on simulated hardware:
+- [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation.
+- [Arm&reg; Corstone&trade; SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) or [SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for simulation.
+
+:::{tip}
+All cross-compilation tools can be downloaded and added to the path:
+```bash
+examples/arm/setup.sh --i-agree-to-the-contained-eula
+source examples/arm/arm-scratch/setup_path.sh
+```
+:::
+
+### 1. Export and quantize
+
+Export the model, then quantize using `CortexMQuantizer` with the PT2E quantization flow:
+
+```python
+import torch
+from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
+from executorch.backends.cortex_m.quantizer.quantizer import CortexMQuantizer
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
+
+model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
+
+example_input = torch.randn(1, 3, 224, 224).to(memory_format=torch.channels_last)
+exported_program = torch.export.export(model, (example_input,))
+graph_module = exported_program.module()
+
+quantizer = CortexMQuantizer()
+prepared = prepare_pt2e(graph_module, quantizer)
+
+# Calibrate with representative data
+for calibration_input in calibration_data:
+    prepared(calibration_input)
+
+quantized = convert_pt2e(prepared)
+quantized_exported_program = torch.export.export(quantized, (example_input,))
+```
+
+### 2. Lower to edge and apply Cortex-M passes
+
+Lower to the edge dialect with a custom `EdgeCompileConfig`, then run the `CortexMPassManager` to replace quantized subgraphs with CMSIS-NN operator implementations:
+
+```python
+from executorch.exir import EdgeCompileConfig, ExecutorchBackendConfig, to_edge
+from executorch.backends.cortex_m.passes.cortex_m_pass_manager import CortexMPassManager
+
+config = EdgeCompileConfig(
+    preserve_ops=[
+        torch.ops.aten.linear.default,
+        torch.ops.aten.hardsigmoid.default,
+        torch.ops.aten.hardsigmoid_.default,
+        torch.ops.aten.hardswish.default,
+        torch.ops.aten.hardswish_.default,
+    ],
+    _check_ir_validity=False,
+    _core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default],
+)
+
+edge_program_manager = to_edge(quantized_exported_program, compile_config=config)
+
+pass_manager = CortexMPassManager(edge_program_manager.exported_program())
+edge_program_manager._edge_programs["forward"] = pass_manager.transform()
+```
+
+### 3. Serialize to .pte
+
+```python
+executorch_program = edge_program_manager.to_executorch(
+    config=ExecutorchBackendConfig(extract_delegate_segments=False)
+)
+
+with open("model.pte", "wb") as f:
+    f.write(executorch_program.buffer)
+```
+
+### 4. Cross-compile and run
+
+Cross-compile the ExecuTorch runtime, Cortex-M kernels, and the example runner application. The first cmake invocation builds the ExecuTorch libraries for Arm baremetal. The second builds the [arm_executor_runner](https://github.com/pytorch/executorch/blob/main/examples/arm/executor_runner/) and links it against those libraries with the `.pte` model baked in.
+
+```bash
+# Build ExecuTorch libraries for Arm baremetal
+cmake --preset arm-baremetal \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DEXECUTORCH_BUILD_DEVTOOLS=ON \
+  -Bcmake-out-arm
+cmake --build cmake-out-arm --target install -j$(nproc)
+
+# Build the executor runner, linking the .pte into the binary
+cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \
+      -DCMAKE_BUILD_TYPE=Release \
+      -DET_PTE_FILE_PATH=$(pwd)/model.pte \
+      -DTARGET_CPU=cortex-m55 \
+      -Bbuild \
+      examples/arm/executor_runner
+cmake --build build -j$(nproc) -- arm_executor_runner
+```
+
+Run on a simulated Cortex-M target:
+
+```bash
+backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u55-128
+```
+
+For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the [Cortex-M MobileNetV2 notebook](https://github.com/pytorch/executorch/blob/main/examples/arm/cortex_m_mv2_example.ipynb).
diff --git a/docs/source/embedded-arm-cortex-m.md b/docs/source/embedded-arm-cortex-m.md
new file mode 100644
index 00000000000..5791e068cef
--- /dev/null
+++ b/docs/source/embedded-arm-cortex-m.md
@@ -0,0 +1 @@
+```{include} backends/arm-cortex-m/arm-cortex-m-overview.md
diff --git a/docs/source/embedded-backends.md b/docs/source/embedded-backends.md
index 4ed7962ef42..147f6cfc151 100644
--- a/docs/source/embedded-backends.md
+++ b/docs/source/embedded-backends.md
@@ -7,6 +7,10 @@ Available hardware acceleration backends for embedded systems.
 
 - {doc}`embedded-cadence` — Cadence Xtensa DSP processors
 
+## CPU Acceleration
+
+- {doc}`embedded-arm-cortex-m` — Arm Cortex-M CMSIS-NN acceleration
+
 ## NPU Acceleration
 
 - {doc}`embedded-arm-ethos-u` — ARM Ethos-U NPU acceleration
@@ -15,6 +19,7 @@ Available hardware acceleration backends for embedded systems.
 
 ```{toctree}
 :hidden:
+embedded-arm-cortex-m
 embedded-cadence
 embedded-arm-ethos-u
 embedded-nxp

From f0b671b56ca6a3042a6858a0b15ed23bf4beef95 Mon Sep 17 00:00:00 2001
From: RJ Ascani <rja@meta.com>
Date: Fri, 20 Mar 2026 10:58:18 -0700
Subject: [PATCH 2/2] Cortex-M: Improve operator table with ATen op and
 CMSIS-NN kernel columns

Add ATen op to CMSIS-NN kernel mapping, Supported column for target
variants, link to CMSIS-NN API docs, add missing operators (minimum,
maximum, permute_copy), use plain .pte instead of .bpte, and align
table columns for readability.

Co-authored-by: Claude <noreply@anthropic.com>
---
 .../arm-cortex-m/arm-cortex-m-overview.md     | 47 ++++++++++---------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md
index 39790db9ed0..7e2cdf00f15 100644
--- a/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md
+++ b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md
@@ -6,31 +6,36 @@ The Arm&reg; Cortex&reg;-M backend accelerates quantized model execution on Arm
 
 The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants:
 
-| Variant | Description | Example CPUs |
-|---------|-------------|--------------|
-| MVE (Helium) | Vector extensions for Arm-M | Cortex-M55, Cortex-M85 |
-| DSP | DSP extension instructions | Cortex-M4, Cortex-M7, Cortex-M33 |
-| Pure C | Reference C implementation | Any Cortex-M |
+| Variant      | Description                 | Example CPUs       | Supported |
+|--------------|-----------------------------|--------------------|-----------|
+| MVE (Helium) | M-profile Vector extensions | Cortex-M55, M85    | ✅        |
+| DSP          | DSP extension instructions  | Cortex-M4, M7, M33 | ⬜        |
+| Pure C       | Reference C implementation  | Any Cortex-M       | ⬜        |
 
-Testing has only been done with MVE targets (Cortex-M55, Cortex-M85). DSP and pure C CMSIS-NN kernels might work as well since the same CMSIS-NN API is used across all variants, but is unverified at this point.
+DSP and pure C variants use the same CMSIS-NN API and may work, but have not been tested.
 
 ## CMSIS-NN Supported Operators
 
-| Operator | 8w8a | 8w16a | 4w8a |
-|---|---|---|---|
-| Conv2D | ✅ | ⬜ | ⬜ |
-| DepthwiseConv2D | ✅ | ⬜ | ⬜ |
-| TransposeConv2D | ✅ | ⬜ | ⬜ |
-| Fully Connected | ✅ | ⬜ | ⬜ |
-| Batch Matmul | ✅ | ⬜ | ⬜ |
-| Add | ✅ | ⬜ | N/A |
-| Mul | ✅ | ⬜ | N/A |
-| MaxPooling | ✅ | ⬜ | N/A |
-| AvgPooling | ✅ | ⬜ | N/A |
-| Softmax | ✅ | ⬜ | N/A |
-| Pad | ✅ | ⬜ | N/A |
-| LSTM | ⬜ | ⬜ | ⬜ |
-| SVDF | ⬜ | ⬜ | ⬜ |
+The backend pass pipeline replaces quantized ATen operators with [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) kernel calls. See the [CMSIS-NN API documentation](https://arm-software.github.io/CMSIS-NN/latest/modules.html) for the full list of available kernels.
+
+| ATen Op                        | CMSIS-NN Kernel        | 8w8a | 8w16a | 4w8a |
+|--------------------------------|------------------------|------|-------|------|
+| `aten.convolution`             | `arm_convolve`         | ✅   | ⬜    | ⬜   |
+| `aten.convolution` (depthwise) | `arm_depthwise_conv`   | ✅   | ⬜    | ⬜   |
+| `aten.convolution` (transposed)| `arm_transpose_conv`   | ✅   | ⬜    | ⬜   |
+| `aten.linear`                  | `arm_fully_connected`  | ✅   | ⬜    | ⬜   |
+| `aten.bmm`                     | `arm_batch_matmul`     | ✅   | ⬜    | ⬜   |
+| `aten.add`                     | `arm_elementwise_add`  | ✅   | ⬜    | N/A  |
+| `aten.mul`                     | `arm_elementwise_mul`  | ✅   | ⬜    | N/A  |
+| `aten.max_pool2d`              | `arm_max_pool`         | ✅   | ⬜    | N/A  |
+| `aten.avg_pool2d`              | `arm_avgpool`          | ✅   | ⬜    | N/A  |
+| `aten._softmax`                | `arm_softmax`          | ✅   | ⬜    | N/A  |
+| `aten.minimum`                 | `arm_minimum`          | ✅   | ⬜    | N/A  |
+| `aten.maximum`                 | `arm_maximum`          | ✅   | ⬜    | N/A  |
+| `aten.permute_copy`            | `arm_transpose`        | ✅   | ⬜    | N/A  |
+| `aten.constant_pad_nd`         | `arm_pad`              | ✅   | ⬜    | N/A  |
+| —                              | LSTM                   | ⬜   | ⬜    | ⬜   |
+| —                              | SVDF                   | ⬜   | ⬜    | ⬜   |
 
 ## Quantization Support