📤 Add export task (coreml and tflite)#174
📤 Add export task (coreml and tflite)#174ramonhollands wants to merge 17 commits intoMultimediaTechLab:mainfrom
Conversation
… loop, undo export param, use FastModelLoader in InferenceModel
|
Hi, Can you check if it is still able to run in this modification? Henry Tsui |
# Conflicts: # yolo/model/yolo.py
|
Hi Henry, |
|
I did try this PR. I think there is error in: should be: Also did you manage to get good performance using |
|
Yeah, you are right about the change. I'm still looking into the slowness. When skipping the export layers in 'if self.export_mode:' it's using ANE and is super fast. If you can help me in debugging that would be great. Eg, what code is hf or ultralytics using for that decoding layers? |
|
Thanks I confirmed after switching to I wish to help but I'm afraid I'm not skilled enough to know how it works. I guess this is the final layer that doing something similar to NMS and if done outside of ML model (post inference) then whole pipeline would be slow as well? I only found this repo and issue that might be helpful (includig comments and related linked there pocketpixels/yolov5 repo) Not sure though if architecture is very different between yolov5 and yolov9. |
|
Asked gemini 2.5 pro about this - cannot verify but it suggested that graph is not static and suggest precomputing anchors outside and pass as input: full LLM markdown output here: |
|
EDIT: My bad this is irrelevant - I tested benchmarked wrong model - this below don't fix - is still slow. I did another test with commenting out some lines in yolo.py: preds_cls = torch.concat(preds_cls, dim=1).to(x[0][0].device)
preds_anc = torch.concat(preds_anc, dim=1).to(x[0][0].device)
preds_box = torch.concat(preds_box, dim=1).to(x[0][0].device)
strides = self.get_strides(output["Main"], input_width)
# anchor_grid, scaler = self.generate_anchors([input_width, input_height], strides) #
# anchor_grid = anchor_grid.to(x[0][0].device)
# scaler = scaler.to(x[0][0].device)
# pred_LTRB = preds_box * scaler.view(1, -1, 1)
# lt, rb = pred_LTRB.chunk(2, dim=-1)
# preds_box = torch.cat([anchor_grid - lt, anchor_grid + rb], dim=-1)
return preds_cls, preds_anc, preds_boxand this still runs very fast so probably gemini is right that the bottleneck code is : |
|
I did try to export with different combinations as well, even made stuff 'static' by using Instead of Next to that, pred_anc is not used in the post process code anywhere, so we can skip that. No luck yet but I'm currently busy with other projects so I will have a look at in a couple of weeks again. |
|
Some good news: I managed to get the small model run in +/- 10ms However the bounding boxes are not yet precise enough due to numerical instability. See difference in box positions between fp16 and fp32
I managed to get the results by clamping the pred_boxes like Any other ideas are very welcome! Code and exported models can be viewed at: |
What kind of speeds were you able to achieve with fp32 and int8?? |
This pull requests adds a new export task including the option to export coreml and tflite format.
Use:
Next to this it adds the option to use the FastModelLoader again during inference.
Tflite export depends on ai_edge_torch which requires Python3.10
Next steps would be to add quantization and auto install missing modules