-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
I trained a semantic segmentation model based on DINOv3 ViT backbone using the Lightly_train framework. The model uses pretrained weights from ViT-B16 and ViT-L16 and is fine-tuned for a segmentation task.
However, I encountered inconsistent inference results when converting the model to TensorRT with different TensorRT versions.
Case 1: ViT-B16 backbone
When converting the model to FP16 TensorRT engines:
-
TensorRT 8.6.1
- Engine builds successfully
- FP16 inference results are correct
-
TensorRT 10.3
- Engine also builds successfully
- But inference results are incorrect
- The predicted segmentation map contains almost a single class value across the entire image
I verified that:
- The ONNX model produces correct results
- The inference code is identical between TensorRT versions
- CNN-based segmentation models run correctly on TensorRT 10.3 using the same pipeline
Therefore, the issue seems specific to Transformer-based models (ViT / DINOv3).
After inspecting the TensorRT engine behavior, I suspect there may be numerical instability or FP16 overflow in TensorRT 10.3, possibly related to attention or LayerNorm operations.
Case 2: ViT-L16 backbone
For a larger model using ViT-L16 weights, the situation is worse.
When converting to FP16 TensorRT engines:
- TensorRT 8.6.1
- TensorRT 10.3
- TensorRT 10.15
All versions build successfully, but inference results are incorrect.
From preliminary analysis, this may be caused by precision overflow or numerical instability in FP16, since ViT-L16 has a deeper Transformer architecture.
Observations
During conversion, I also noticed a difference in attention handling:
- In TensorRT 8.6.1, multi-head attention seems to be fused into optimized kernels
- In TensorRT 10.3, attention operations appear to be fully decomposed into MatMul / Transpose / Softmax layers
This difference might be related to the observed inference errors.
Questions
1、Why does TensorRT 8.6.1 produce correct results while TensorRT 10.3 produces incorrect results for the same ViT-B16 model?
2、Could this be related to:
- numerical instability in FP16
- attention decomposition in TensorRT 10.x
- LayerNorm precision issues?
3、For deeper Transformer models such as ViT-L16, what is the recommended way to build stable FP16 TensorRT engines?
For example:
- Should certain layers (e.g., LayerNorm or Softmax) be forced to FP32?
- Are there recommended TensorRT build flags or plugins for Transformer models?
Any guidance would be greatly appreciated.