TensorRT 10.3 FP16 inference produces incorrect segmentation results for DINOv3 ViT models while TensorRT 8.6.1 works correctly

## Description

I trained a semantic segmentation model based on DINOv3 ViT backbone using the Lightly_train framework. The model uses pretrained weights from ViT-B16 and ViT-L16 and is fine-tuned for a segmentation task.
However, I encountered inconsistent inference results when converting the model to TensorRT with different TensorRT versions.

### Case 1: ViT-B16 backbone
When converting the model to FP16 TensorRT engines:

- TensorRT 8.6.1
  - Engine builds successfully
  - FP16 inference results are correct

- TensorRT 10.3
  - Engine also builds successfully
  - But inference results are incorrect
  - The predicted segmentation map contains almost a single class value across the entire image

I verified that:
  - The ONNX model produces correct results
  - The inference code is identical between TensorRT versions
  - CNN-based segmentation models run correctly on TensorRT 10.3 using the same pipeline

Therefore, the issue seems specific to Transformer-based models (ViT / DINOv3).
After inspecting the TensorRT engine behavior, I suspect there may be numerical instability or FP16 overflow in TensorRT 10.3, possibly related to attention or LayerNorm operations.

### Case 2: ViT-L16 backbone

For a larger model using ViT-L16 weights, the situation is worse.

When converting to FP16 TensorRT engines:
  - TensorRT 8.6.1
  - TensorRT 10.3
  - TensorRT 10.15

All versions build successfully, but inference results are incorrect.

From preliminary analysis, this may be caused by precision overflow or numerical instability in FP16, since ViT-L16 has a deeper Transformer architecture.

Observations
During conversion, I also noticed a difference in attention handling:

  1. In TensorRT 8.6.1, multi-head attention seems to be fused into optimized kernels
  2. In TensorRT 10.3, attention operations appear to be fully decomposed into MatMul / Transpose / Softmax layers
This difference might be related to the observed inference errors.

## Questions

1、Why does TensorRT 8.6.1 produce correct results while TensorRT 10.3 produces incorrect results for the same ViT-B16 model?

2、Could this be related to:

  - numerical instability in FP16
  - attention decomposition in TensorRT 10.x
  - LayerNorm precision issues?

3、For deeper Transformer models such as ViT-L16, what is the recommended way to build stable FP16 TensorRT engines?

For example:

  - Should certain layers (e.g., LayerNorm or Softmax) be forced to FP32?
  - Are there recommended TensorRT build flags or plugins for Transformer models?

Any guidance would be greatly appreciated.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT 10.3 FP16 inference produces incorrect segmentation results for DINOv3 ViT models while TensorRT 8.6.1 works correctly #4723

Description

Case 1: ViT-B16 backbone

Case 2: ViT-L16 backbone

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorRT 10.3 FP16 inference produces incorrect segmentation results for DINOv3 ViT models while TensorRT 8.6.1 works correctly #4723

Description

Description

Case 1: ViT-B16 backbone

Case 2: ViT-L16 backbone

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions