diff --git a/docs/source/backends-overview.md b/docs/source/backends-overview.md index fc8ab1a0166..d1c48eb4032 100644 --- a/docs/source/backends-overview.md +++ b/docs/source/backends-overview.md @@ -28,6 +28,7 @@ Backends are the bridge between your exported model and the hardware it runs on. | [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs | | [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs | | [Arm Ethos-U](/backends/arm-ethos-u/arm-ethos-u-overview.md) | Embedded | NPU | Arm MCUs | +| [Arm Cortex-M](/backends/arm-cortex-m/arm-cortex-m-overview.md) | Embedded | CPU | Arm Cortex-M MCUs | | [Arm VGF](/backends/arm-vgf/arm-vgf-overview.md) | Android | GPU | Arm platforms | | [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs | | [NXP](backends/nxp/nxp-overview.md) | Embedded | NPU | NXP SoCs | @@ -59,6 +60,7 @@ backends/vulkan/vulkan-overview backends-qualcomm backends-mediatek backends/arm-ethos-u/arm-ethos-u-overview +backends/arm-cortex-m/arm-cortex-m-overview backends/arm-vgf/arm-vgf-overview build-run-openvino backends/nxp/nxp-overview diff --git a/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md new file mode 100644 index 00000000000..7e2cdf00f15 --- /dev/null +++ b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md @@ -0,0 +1,162 @@ +# Arm Cortex-M Backend + +The Arm® Cortex®-M backend accelerates quantized model execution on Arm Cortex-M CPUs using [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) optimized kernels. Unlike delegate-based backends, it operates as an operator library: quantized subgraphs are replaced with CMSIS-NN accelerated kernels during the pass-lowering stage, while unsupported operators fall back to portable fp32 kernels. + +## Target Support + +The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants: + +| Variant | Description | Example CPUs | Supported | +|--------------|-----------------------------|--------------------|-----------| +| MVE (Helium) | M-profile Vector extensions | Cortex-M55, M85 | ✅ | +| DSP | DSP extension instructions | Cortex-M4, M7, M33 | ⬜ | +| Pure C | Reference C implementation | Any Cortex-M | ⬜ | + +DSP and pure C variants use the same CMSIS-NN API and may work, but have not been tested. + +## CMSIS-NN Supported Operators + +The backend pass pipeline replaces quantized ATen operators with [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) kernel calls. See the [CMSIS-NN API documentation](https://arm-software.github.io/CMSIS-NN/latest/modules.html) for the full list of available kernels. + +| ATen Op | CMSIS-NN Kernel | 8w8a | 8w16a | 4w8a | +|--------------------------------|------------------------|------|-------|------| +| `aten.convolution` | `arm_convolve` | ✅ | ⬜ | ⬜ | +| `aten.convolution` (depthwise) | `arm_depthwise_conv` | ✅ | ⬜ | ⬜ | +| `aten.convolution` (transposed)| `arm_transpose_conv` | ✅ | ⬜ | ⬜ | +| `aten.linear` | `arm_fully_connected` | ✅ | ⬜ | ⬜ | +| `aten.bmm` | `arm_batch_matmul` | ✅ | ⬜ | ⬜ | +| `aten.add` | `arm_elementwise_add` | ✅ | ⬜ | N/A | +| `aten.mul` | `arm_elementwise_mul` | ✅ | ⬜ | N/A | +| `aten.max_pool2d` | `arm_max_pool` | ✅ | ⬜ | N/A | +| `aten.avg_pool2d` | `arm_avgpool` | ✅ | ⬜ | N/A | +| `aten._softmax` | `arm_softmax` | ✅ | ⬜ | N/A | +| `aten.minimum` | `arm_minimum` | ✅ | ⬜ | N/A | +| `aten.maximum` | `arm_maximum` | ✅ | ⬜ | N/A | +| `aten.permute_copy` | `arm_transpose` | ✅ | ⬜ | N/A | +| `aten.constant_pad_nd` | `arm_pad` | ✅ | ⬜ | N/A | +| — | LSTM | ⬜ | ⬜ | ⬜ | +| — | SVDF | ⬜ | ⬜ | ⬜ | + +## Quantization Support + +The Cortex-M backend currently implements **symmetric INT8 (8w8a)** quantization: +- **Per-channel** quantization for convolution operators. +- **Per-tensor** quantization for all other supported operators. +- **Shared quantization parameters** for data-movement operators (e.g. reshape, permute) to avoid unnecessary requantization. + +CMSIS-NN also supports INT4 weights with INT8 activations (4w8a) and INT8 weights with INT16 activations (8w16a), but the corresponding quantizer configuration and operator implementations are not yet integrated. + +## Tutorial + +### Prerequisites + +Install the ExecuTorch pip package: +```bash +./install_executorch.sh +``` + +For cross-compilation and running on simulated hardware: +- [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation. +- [Arm® Corstone™ SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) or [SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for simulation. + +:::{tip} +All cross-compilation tools can be downloaded and added to the path: +```bash +examples/arm/setup.sh --i-agree-to-the-contained-eula +source examples/arm/arm-scratch/setup_path.sh +``` +::: + +### 1. Export and quantize + +Export the model, then quantize using `CortexMQuantizer` with the PT2E quantization flow: + +```python +import torch +from torchvision.models import mobilenet_v2, MobileNet_V2_Weights +from executorch.backends.cortex_m.quantizer.quantizer import CortexMQuantizer +from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e + +model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() + +example_input = torch.randn(1, 3, 224, 224).to(memory_format=torch.channels_last) +exported_program = torch.export.export(model, (example_input,)) +graph_module = exported_program.module() + +quantizer = CortexMQuantizer() +prepared = prepare_pt2e(graph_module, quantizer) + +# Calibrate with representative data +for calibration_input in calibration_data: + prepared(calibration_input) + +quantized = convert_pt2e(prepared) +quantized_exported_program = torch.export.export(quantized, (example_input,)) +``` + +### 2. Lower to edge and apply Cortex-M passes + +Lower to the edge dialect with a custom `EdgeCompileConfig`, then run the `CortexMPassManager` to replace quantized subgraphs with CMSIS-NN operator implementations: + +```python +from executorch.exir import EdgeCompileConfig, ExecutorchBackendConfig, to_edge +from executorch.backends.cortex_m.passes.cortex_m_pass_manager import CortexMPassManager + +config = EdgeCompileConfig( + preserve_ops=[ + torch.ops.aten.linear.default, + torch.ops.aten.hardsigmoid.default, + torch.ops.aten.hardsigmoid_.default, + torch.ops.aten.hardswish.default, + torch.ops.aten.hardswish_.default, + ], + _check_ir_validity=False, + _core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default], +) + +edge_program_manager = to_edge(quantized_exported_program, compile_config=config) + +pass_manager = CortexMPassManager(edge_program_manager.exported_program()) +edge_program_manager._edge_programs["forward"] = pass_manager.transform() +``` + +### 3. Serialize to .pte + +```python +executorch_program = edge_program_manager.to_executorch( + config=ExecutorchBackendConfig(extract_delegate_segments=False) +) + +with open("model.pte", "wb") as f: + f.write(executorch_program.buffer) +``` + +### 4. Cross-compile and run + +Cross-compile the ExecuTorch runtime, Cortex-M kernels, and the example runner application. The first cmake invocation builds the ExecuTorch libraries for Arm baremetal. The second builds the [arm_executor_runner](https://github.com/pytorch/executorch/blob/main/examples/arm/executor_runner/) and links it against those libraries with the `.pte` model baked in. + +```bash +# Build ExecuTorch libraries for Arm baremetal +cmake --preset arm-baremetal \ + -DCMAKE_BUILD_TYPE=Release \ + -DEXECUTORCH_BUILD_DEVTOOLS=ON \ + -Bcmake-out-arm +cmake --build cmake-out-arm --target install -j$(nproc) + +# Build the executor runner, linking the .pte into the binary +cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \ + -DCMAKE_BUILD_TYPE=Release \ + -DET_PTE_FILE_PATH=$(pwd)/model.pte \ + -DTARGET_CPU=cortex-m55 \ + -Bbuild \ + examples/arm/executor_runner +cmake --build build -j$(nproc) -- arm_executor_runner +``` + +Run on a simulated Cortex-M target: + +```bash +backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u55-128 +``` + +For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the [Cortex-M MobileNetV2 notebook](https://github.com/pytorch/executorch/blob/main/examples/arm/cortex_m_mv2_example.ipynb). diff --git a/docs/source/embedded-arm-cortex-m.md b/docs/source/embedded-arm-cortex-m.md new file mode 100644 index 00000000000..5791e068cef --- /dev/null +++ b/docs/source/embedded-arm-cortex-m.md @@ -0,0 +1 @@ +```{include} backends/arm-cortex-m/arm-cortex-m-overview.md diff --git a/docs/source/embedded-backends.md b/docs/source/embedded-backends.md index 4ed7962ef42..147f6cfc151 100644 --- a/docs/source/embedded-backends.md +++ b/docs/source/embedded-backends.md @@ -7,6 +7,10 @@ Available hardware acceleration backends for embedded systems. - {doc}`embedded-cadence` — Cadence Xtensa DSP processors +## CPU Acceleration + +- {doc}`embedded-arm-cortex-m` — Arm Cortex-M CMSIS-NN acceleration + ## NPU Acceleration - {doc}`embedded-arm-ethos-u` — ARM Ethos-U NPU acceleration @@ -15,6 +19,7 @@ Available hardware acceleration backends for embedded systems. ```{toctree} :hidden: +embedded-arm-cortex-m embedded-cadence embedded-arm-ethos-u embedded-nxp