Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/backends-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Backends are the bridge between your exported model and the hardware it runs on.
| [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs |
| [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs |
| [Arm Ethos-U](/backends/arm-ethos-u/arm-ethos-u-overview.md) | Embedded | NPU | Arm MCUs |
| [Arm Cortex-M](/backends/arm-cortex-m/arm-cortex-m-overview.md) | Embedded | CPU | Arm Cortex-M MCUs |
| [Arm VGF](/backends/arm-vgf/arm-vgf-overview.md) | Android | GPU | Arm platforms |
| [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs |
| [NXP](backends/nxp/nxp-overview.md) | Embedded | NPU | NXP SoCs |
Expand Down Expand Up @@ -59,6 +60,7 @@ backends/vulkan/vulkan-overview
backends-qualcomm
backends-mediatek
backends/arm-ethos-u/arm-ethos-u-overview
backends/arm-cortex-m/arm-cortex-m-overview
backends/arm-vgf/arm-vgf-overview
build-run-openvino
backends/nxp/nxp-overview
Expand Down
162 changes: 162 additions & 0 deletions docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Arm Cortex-M Backend

The Arm® Cortex®-M backend accelerates quantized model execution on Arm Cortex-M CPUs using [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) optimized kernels. Unlike delegate-based backends, it operates as an operator library: quantized subgraphs are replaced with CMSIS-NN accelerated kernels during the pass-lowering stage, while unsupported operators fall back to portable fp32 kernels.

## Target Support

The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants:

| Variant | Description | Example CPUs | Supported |
|--------------|-----------------------------|--------------------|-----------|
| MVE (Helium) | M-profile Vector extensions | Cortex-M55, M85 | ✅ |
| DSP | DSP extension instructions | Cortex-M4, M7, M33 | ⬜ |
| Pure C | Reference C implementation | Any Cortex-M | ⬜ |

DSP and pure C variants use the same CMSIS-NN API and may work, but have not been tested.

## CMSIS-NN Supported Operators

The backend pass pipeline replaces quantized ATen operators with [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) kernel calls. See the [CMSIS-NN API documentation](https://arm-software.github.io/CMSIS-NN/latest/modules.html) for the full list of available kernels.

| ATen Op | CMSIS-NN Kernel | 8w8a | 8w16a | 4w8a |
|--------------------------------|------------------------|------|-------|------|
| `aten.convolution` | `arm_convolve` | ✅ | ⬜ | ⬜ |
| `aten.convolution` (depthwise) | `arm_depthwise_conv` | ✅ | ⬜ | ⬜ |
| `aten.convolution` (transposed)| `arm_transpose_conv` | ✅ | ⬜ | ⬜ |
| `aten.linear` | `arm_fully_connected` | ✅ | ⬜ | ⬜ |
| `aten.bmm` | `arm_batch_matmul` | ✅ | ⬜ | ⬜ |
| `aten.add` | `arm_elementwise_add` | ✅ | ⬜ | N/A |
| `aten.mul` | `arm_elementwise_mul` | ✅ | ⬜ | N/A |
| `aten.max_pool2d` | `arm_max_pool` | ✅ | ⬜ | N/A |
| `aten.avg_pool2d` | `arm_avgpool` | ✅ | ⬜ | N/A |
| `aten._softmax` | `arm_softmax` | ✅ | ⬜ | N/A |
| `aten.minimum` | `arm_minimum` | ✅ | ⬜ | N/A |
| `aten.maximum` | `arm_maximum` | ✅ | ⬜ | N/A |
| `aten.permute_copy` | `arm_transpose` | ✅ | ⬜ | N/A |
| `aten.constant_pad_nd` | `arm_pad` | ✅ | ⬜ | N/A |
| — | LSTM | ⬜ | ⬜ | ⬜ |
| — | SVDF | ⬜ | ⬜ | ⬜ |

## Quantization Support

The Cortex-M backend currently implements **symmetric INT8 (8w8a)** quantization:
- **Per-channel** quantization for convolution operators.
- **Per-tensor** quantization for all other supported operators.
- **Shared quantization parameters** for data-movement operators (e.g. reshape, permute) to avoid unnecessary requantization.

CMSIS-NN also supports INT4 weights with INT8 activations (4w8a) and INT8 weights with INT16 activations (8w16a), but the corresponding quantizer configuration and operator implementations are not yet integrated.

## Tutorial

### Prerequisites

Install the ExecuTorch pip package:
```bash
./install_executorch.sh
```

For cross-compilation and running on simulated hardware:
- [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation.
- [Arm® Corstone™ SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) or [SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for simulation.

:::{tip}
All cross-compilation tools can be downloaded and added to the path:
```bash
examples/arm/setup.sh --i-agree-to-the-contained-eula
source examples/arm/arm-scratch/setup_path.sh
```
:::

### 1. Export and quantize

Export the model, then quantize using `CortexMQuantizer` with the PT2E quantization flow:

```python
import torch
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from executorch.backends.cortex_m.quantizer.quantizer import CortexMQuantizer
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e

model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()

example_input = torch.randn(1, 3, 224, 224).to(memory_format=torch.channels_last)
exported_program = torch.export.export(model, (example_input,))
graph_module = exported_program.module()

quantizer = CortexMQuantizer()
prepared = prepare_pt2e(graph_module, quantizer)

# Calibrate with representative data
for calibration_input in calibration_data:
prepared(calibration_input)

quantized = convert_pt2e(prepared)
quantized_exported_program = torch.export.export(quantized, (example_input,))
```

### 2. Lower to edge and apply Cortex-M passes

Lower to the edge dialect with a custom `EdgeCompileConfig`, then run the `CortexMPassManager` to replace quantized subgraphs with CMSIS-NN operator implementations:

```python
from executorch.exir import EdgeCompileConfig, ExecutorchBackendConfig, to_edge
from executorch.backends.cortex_m.passes.cortex_m_pass_manager import CortexMPassManager

config = EdgeCompileConfig(
preserve_ops=[
torch.ops.aten.linear.default,
torch.ops.aten.hardsigmoid.default,
torch.ops.aten.hardsigmoid_.default,
torch.ops.aten.hardswish.default,
torch.ops.aten.hardswish_.default,
],
_check_ir_validity=False,
_core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default],
)

edge_program_manager = to_edge(quantized_exported_program, compile_config=config)

pass_manager = CortexMPassManager(edge_program_manager.exported_program())
edge_program_manager._edge_programs["forward"] = pass_manager.transform()
```

### 3. Serialize to .pte

```python
executorch_program = edge_program_manager.to_executorch(
config=ExecutorchBackendConfig(extract_delegate_segments=False)
)

with open("model.pte", "wb") as f:
f.write(executorch_program.buffer)
```

### 4. Cross-compile and run

Cross-compile the ExecuTorch runtime, Cortex-M kernels, and the example runner application. The first cmake invocation builds the ExecuTorch libraries for Arm baremetal. The second builds the [arm_executor_runner](https://github.com/pytorch/executorch/blob/main/examples/arm/executor_runner/) and links it against those libraries with the `.pte` model baked in.

```bash
# Build ExecuTorch libraries for Arm baremetal
cmake --preset arm-baremetal \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_DEVTOOLS=ON \
-Bcmake-out-arm
cmake --build cmake-out-arm --target install -j$(nproc)

# Build the executor runner, linking the .pte into the binary
cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \
-DCMAKE_BUILD_TYPE=Release \
-DET_PTE_FILE_PATH=$(pwd)/model.pte \
-DTARGET_CPU=cortex-m55 \
-Bbuild \
examples/arm/executor_runner
cmake --build build -j$(nproc) -- arm_executor_runner
```

Run on a simulated Cortex-M target:

```bash
backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u55-128
```

For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the [Cortex-M MobileNetV2 notebook](https://github.com/pytorch/executorch/blob/main/examples/arm/cortex_m_mv2_example.ipynb).
1 change: 1 addition & 0 deletions docs/source/embedded-arm-cortex-m.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
```{include} backends/arm-cortex-m/arm-cortex-m-overview.md
5 changes: 5 additions & 0 deletions docs/source/embedded-backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ Available hardware acceleration backends for embedded systems.

- {doc}`embedded-cadence` — Cadence Xtensa DSP processors

## CPU Acceleration

- {doc}`embedded-arm-cortex-m` — Arm Cortex-M CMSIS-NN acceleration

## NPU Acceleration

- {doc}`embedded-arm-ethos-u` — ARM Ethos-U NPU acceleration
Expand All @@ -15,6 +19,7 @@ Available hardware acceleration backends for embedded systems.

```{toctree}
:hidden:
embedded-arm-cortex-m
embedded-cadence
embedded-arm-ethos-u
embedded-nxp
Loading