Skip to content

TensorRT 10.3 FP16 inference produces incorrect segmentation results for DINOv3 ViT models while TensorRT 8.6.1 works correctly #4723

@dingliangxiansheng

Description

@dingliangxiansheng

Description

I trained a semantic segmentation model based on DINOv3 ViT backbone using the Lightly_train framework. The model uses pretrained weights from ViT-B16 and ViT-L16 and is fine-tuned for a segmentation task.
However, I encountered inconsistent inference results when converting the model to TensorRT with different TensorRT versions.

Case 1: ViT-B16 backbone

When converting the model to FP16 TensorRT engines:

  • TensorRT 8.6.1

    • Engine builds successfully
    • FP16 inference results are correct
  • TensorRT 10.3

    • Engine also builds successfully
    • But inference results are incorrect
    • The predicted segmentation map contains almost a single class value across the entire image

I verified that:

  • The ONNX model produces correct results
  • The inference code is identical between TensorRT versions
  • CNN-based segmentation models run correctly on TensorRT 10.3 using the same pipeline

Therefore, the issue seems specific to Transformer-based models (ViT / DINOv3).
After inspecting the TensorRT engine behavior, I suspect there may be numerical instability or FP16 overflow in TensorRT 10.3, possibly related to attention or LayerNorm operations.

Case 2: ViT-L16 backbone

For a larger model using ViT-L16 weights, the situation is worse.

When converting to FP16 TensorRT engines:

  • TensorRT 8.6.1
  • TensorRT 10.3
  • TensorRT 10.15

All versions build successfully, but inference results are incorrect.

From preliminary analysis, this may be caused by precision overflow or numerical instability in FP16, since ViT-L16 has a deeper Transformer architecture.

Observations
During conversion, I also noticed a difference in attention handling:

  1. In TensorRT 8.6.1, multi-head attention seems to be fused into optimized kernels
  2. In TensorRT 10.3, attention operations appear to be fully decomposed into MatMul / Transpose / Softmax layers
    This difference might be related to the observed inference errors.

Questions

1、Why does TensorRT 8.6.1 produce correct results while TensorRT 10.3 produces incorrect results for the same ViT-B16 model?

2、Could this be related to:

  • numerical instability in FP16
  • attention decomposition in TensorRT 10.x
  • LayerNorm precision issues?

3、For deeper Transformer models such as ViT-L16, what is the recommended way to build stable FP16 TensorRT engines?

For example:

  • Should certain layers (e.g., LayerNorm or Softmax) be forced to FP32?
  • Are there recommended TensorRT build flags or plugins for Transformer models?

Any guidance would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Module:AccuracyOutput mismatch between TensorRT and other frameworks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions