📤 Add export task (coreml and tflite) by ramonhollands · Pull Request #174 · MultimediaTechLab/YOLO

ramonhollands · 2025-02-20T07:05:52Z

This pull requests adds a new export task including the option to export coreml and tflite format.

Use:

python yolo/lazy.py task=export name=ExportCoreml model=v9-s task.format=coreml
python yolo/lazy.py task=export name=ExportTflite model=v9-s task.format=tflite

Next to this it adds the option to use the FastModelLoader again during inference.

python yolo/lazy.py task=inference name=TfliteInference device=cpu model=v9-s task.nms.min_confidence=0.1 task.fast_inference=tflite use_wandb=False task.data.source=demo/images/test.jpg

Tflite export depends on ai_edge_torch which requires Python3.10

Next steps would be to add quantization and auto install missing modules

… loop, undo export param, use FastModelLoader in InferenceModel

henrytsui000 · 2025-02-20T09:21:30Z

Hi,
I notice a change in yolo.py, which converts enumerations to an explicit counter (idx). We have updated our forward function with two actions: shortcut (directly obtaining an output from a middle layer) and external (inputting the model with other external sources) tensors.

Can you check if it is still able to run in this modification?

Henry Tsui

# Conflicts: # yolo/model/yolo.py

ramonhollands · 2025-02-20T09:41:32Z

Hi Henry,
I merged the latest Main branch inside the 'add-export-task' branch and can confirm it still works correctly.
Best regards,
Ramon

pzoltowski · 2025-04-19T18:02:21Z

I did try this PR. I think there is error in:

if self.format == "coreml":
            export_mode = True

should be:

if format == "coreml":
            export_mode = True

Also did you manage to get good performance using ct.ComputeUnit.CPU_AND_NE in xcode benchmark? I'm getting this exported to CoreML model ~15x slower than similar CoreML model from hf or ultralytics. Also around 10x slower than when exported to onnx and executed on CoreML provider. Somehow it doesn't want to execute any operation on ANE. I did try many different tweak and settings but no luck.

ramonhollands · 2025-04-19T19:30:47Z

Yeah, you are right about the change.

I'm still looking into the slowness. When skipping the export layers in 'if self.export_mode:' it's using ANE and is super fast.

If you can help me in debugging that would be great.

Eg, what code is hf or ultralytics using for that decoding layers?

        if self.export_mode:

            preds_cls, preds_anc, preds_box = [], [], []
            for layer_output in output["Main"]:
                pred_cls, pred_anc, pred_box = layer_output
                preds_cls.append(pred_cls.permute(0, 2, 3, 1).reshape(pred_cls.shape[0], -1, pred_cls.shape[1]))
                preds_anc.append(
                    pred_anc.permute(0, 3, 4, 1, 2).reshape(pred_anc.shape[0], -1, pred_anc.shape[2], pred_anc.shape[1])
                )
                preds_box.append(pred_box.permute(0, 2, 3, 1).reshape(pred_box.shape[0], -1, pred_box.shape[1]))

            preds_cls = torch.concat(preds_cls, dim=1).to(x[0][0].device)
            preds_anc = torch.concat(preds_anc, dim=1).to(x[0][0].device)
            preds_box = torch.concat(preds_box, dim=1).to(x[0][0].device)

            strides = self.get_strides(output["Main"], input_width)
            anchor_grid, scaler = self.generate_anchors([input_width, input_height], strides)  #
            anchor_grid = anchor_grid.to(x[0][0].device)
            scaler = scaler.to(x[0][0].device)
            pred_LTRB = preds_box * scaler.view(1, -1, 1)
            lt, rb = pred_LTRB.chunk(2, dim=-1)
            preds_box = torch.cat([anchor_grid - lt, anchor_grid + rb], dim=-1)

            return preds_cls, preds_anc, preds_box

pzoltowski · 2025-04-20T04:51:54Z

Thanks I confirmed after switching to export_mode = False and removing outputs=outputs, from ct.convert() inference is very fast and on ANE. I also tried to see what happens if switch to tracing method (exported_program = torch.jit.trace(self.model, example_input, strict=False)) instead of exported_program = torch.export.export(self.model, (example_input,)) - but MIT compiler failes in this mode.

I wish to help but I'm afraid I'm not skilled enough to know how it works. I guess this is the final layer that doing something similar to NMS and if done outside of ML model (post inference) then whole pipeline would be slow as well?

I only found this repo and issue that might be helpful (includig comments and related linked there pocketpixels/yolov5 repo) Not sure though if architecture is very different between yolov5 and yolov9.

https://gitlab.com/ultralytics/yolov5/-/merge_requests/7263

pzoltowski · 2025-04-20T05:33:26Z

Asked gemini 2.5 pro about this - cannot verify but it suggested that graph is not static and suggest precomputing anchors outside and pass as input:

3. Why it Fails on ANE / Becomes Slow

Dynamic Shapes & Operations: The core problem is the dynamic calculation of anchors (generate_anchors) and the dependency of tensor shapes (like the size N in preds_cls, preds_box) on the input image dimensions (input_width, input_height). ANE requires static computation graphs with fixed tensor shapes known at compile time (when converting to Core ML). Operations like torch.arange, torch.meshgrid, and shape calculations based on x.shape within the forward pass make the graph dynamic and prevent ANE execution.

Unsupported Ops: While basic operations are often supported, the specific combination or certain dynamic operations might trigger fallbacks to the CPU or GPU, negating ANE benefits. The anchor generation is the most likely culprit.
Missing NMS: Even if this decoding could run on ANE, it's incomplete. You still need NMS, which is computationally intensive. If NMS is done outside the model on the CPU afterwards, it remains a bottleneck.

4. Solution: Leverage Core ML's Native Capabilities
The goal is to have the entire pipeline (inference + decoding + NMS) run efficiently, ideally using ANE. The standard approach for Apple devices, as hinted by the YOLOv5 CoreML example, is:

Export a "Simpler" PyTorch Model: Export a version of the model that outputs raw or slightly processed predictions, but crucially, without the dynamic anchor generation and decoding logic inside forward. The computation graph must be static.

Convert to Core ML: Convert this simplified PyTorch model to a basic Core ML model.
Add Decoding and NMS Layers within Core ML: Use coremltools to modify the Core ML model's specification (.mlmodel file) by adding native Core ML layers to perform the decoding and NMS. Core ML has built-in, optimized layers for NMS.

Proposed Steps:
Define Fixed Export Size: Choose a fixed input size (e.g., [640, 640]) for which you will export the model. ANE works best with fixed sizes.
Precompute Anchors: For this fixed size, precompute the anchor_grid and scaler tensors outside the forward pass. Treat them as constants.
Create an Export Wrapper Module: This module will contain the decoding logic, but using the precomputed constants.

full LLM markdown output here:
gemni2.5pro_yolov9_slow_on_ane_output_postprocessing.md.txt

pzoltowski · 2025-04-20T05:41:36Z

EDIT: My bad this is irrelevant - I tested benchmarked wrong model - this below don't fix - is still slow.

I did another test with commenting out some lines in yolo.py:

            preds_cls = torch.concat(preds_cls, dim=1).to(x[0][0].device)
            preds_anc = torch.concat(preds_anc, dim=1).to(x[0][0].device)
            preds_box = torch.concat(preds_box, dim=1).to(x[0][0].device)

            strides = self.get_strides(output["Main"], input_width)
            
            # anchor_grid, scaler = self.generate_anchors([input_width, input_height], strides)  #
            # anchor_grid = anchor_grid.to(x[0][0].device)
            # scaler = scaler.to(x[0][0].device)
            # pred_LTRB = preds_box * scaler.view(1, -1, 1)
            # lt, rb = pred_LTRB.chunk(2, dim=-1)
            # preds_box = torch.cat([anchor_grid - lt, anchor_grid + rb], dim=-1)

            return preds_cls, preds_anc, preds_box

and this still runs very fast so probably gemini is right that the bottleneck code is :
anchor_grid, scaler = self.generate_anchors([input_width, input_height], strides)

ramonhollands · 2025-04-20T06:55:12Z

I did try to export with different combinations as well, even made stuff 'static' by using

B, C, H, W = pred_cls.shape
pred_cls = pred_cls.contiguous().view(B, C, H * W).transpose(1, 2)

Instead of

preds_cls.append(pred_cls.permute(0, 2, 3, 1).reshape(pred_cls.shape[0], -1, pred_cls.shape[1]))

Next to that, pred_anc is not used in the post process code anywhere, so we can skip that.

No luck yet but I'm currently busy with other projects so I will have a look at in a couple of weeks again.

ramonhollands · 2025-07-13T21:05:58Z

Some good news: I managed to get the small model run in +/- 10ms

https://github.com/ramonhollands/YOLO/blob/experiment_1_float16_float32_comparison/Performance.png

However the bounding boxes are not yet precise enough due to numerical instability.

See difference in box positions between fp16 and fp32

I managed to get the results by clamping the pred_boxes like
pred_box = torch.clamp(pred_box, min=0, max=5.0)
I guess next step is to train with a soft penalty for excessive prediction distance.

Any other ideas are very welcome!

Code and exported models can be viewed at:
https://github.com/ramonhollands/YOLO/tree/experiment_1_float16_float32_comparison

JeremyColfer27 · 2025-10-07T15:15:27Z

Some good news: I managed to get the small model run in +/- 10ms

https://github.com/ramonhollands/YOLO/blob/experiment_1_float16_float32_comparison/Performance.png

However the bounding boxes are not yet precise enough due to numerical instability.

See difference in box positions between fp16 and fp32

https://github.com/ramonhollands/YOLO/blob/experiment_1_float16_float32_comparison/runs/inference/Coreml16/frame003.png

https://github.com/ramonhollands/YOLO/blob/experiment_1_float16_float32_comparison/runs/inference/Coreml32/frame003.png

I managed to get the results by clamping the pred_boxes like pred_box = torch.clamp(pred_box, min=0, max=5.0) I guess next step is to train with a soft penalty for excessive prediction distance.

Any other ideas are very welcome!

Code and exported models can be viewed at: https://github.com/ramonhollands/YOLO/tree/experiment_1_float16_float32_comparison

What kind of speeds were you able to achieve with fp32 and int8??

ramonhollands added 5 commits February 4, 2025 08:53

WIP exports

c11433b

🍏 Add CoreMl Export

6554034

🍏 Add scripts for CoreML and ONNX export, refactor YOLO model forward…

61895c2

… loop, undo export param, use FastModelLoader in InferenceModel

🧠 Add TFLite export functionality and update model loading logic

50a6279

Merge remote-tracking branch 'upstream/main' into add-export-task

a835345

ramonhollands mentioned this pull request Feb 20, 2025

Would you support conversion from torch to onnx and ncnn? #46

Open

ramonhollands changed the title ~~📤 Add export task~~ 📤 Add export task (coreml and tflite) Feb 20, 2025

ramonhollands added 2 commits February 20, 2025 08:36

🐛 Bugfix pipeline

5246d42

🐛 Bugfix pipeline

a5cbf06

Merge remote-tracking branch 'upstream/main' into add-export-task

bab9d4d

# Conflicts: # yolo/model/yolo.py

ramonhollands added 9 commits February 20, 2025 14:49

📜 Sort imports

3d4c28e

📜 Sort imports

7f7d286

🧹 Clean code formats

f13d1df

🍏 Integrate post processing in model for coreml exports

718d28c

🍏 Integrate post processing in model for coreml exports

89ea875

🐛 Fix pipeline

3bfa804

🧹 Correct output_names for coreml exports

114beef

🧹 Clean code

116624c

🧹 Clean code

e002e3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📤 Add export task (coreml and tflite)#174

📤 Add export task (coreml and tflite)#174
ramonhollands wants to merge 17 commits intoMultimediaTechLab:mainfrom
ramonhollands:add-export-task

ramonhollands commented Feb 20, 2025 •

edited

Loading

Uh oh!

henrytsui000 commented Feb 20, 2025

Uh oh!

ramonhollands commented Feb 20, 2025

Uh oh!

pzoltowski commented Apr 19, 2025

Uh oh!

ramonhollands commented Apr 19, 2025

Uh oh!

pzoltowski commented Apr 20, 2025

Uh oh!

pzoltowski commented Apr 20, 2025

Uh oh!

pzoltowski commented Apr 20, 2025 •

edited

Loading

Uh oh!

ramonhollands commented Apr 20, 2025

Uh oh!

ramonhollands commented Jul 13, 2025

Uh oh!

JeremyColfer27 commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ramonhollands commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henrytsui000 commented Feb 20, 2025

Uh oh!

ramonhollands commented Feb 20, 2025

Uh oh!

pzoltowski commented Apr 19, 2025

Uh oh!

ramonhollands commented Apr 19, 2025

Uh oh!

pzoltowski commented Apr 20, 2025

Uh oh!

pzoltowski commented Apr 20, 2025

Uh oh!

pzoltowski commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ramonhollands commented Apr 20, 2025

Uh oh!

ramonhollands commented Jul 13, 2025

Uh oh!

JeremyColfer27 commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ramonhollands commented Feb 20, 2025 •

edited

Loading

pzoltowski commented Apr 20, 2025 •

edited

Loading