inference discrepancy between onnxruntime and trt8.6(trt10.11 too)

**Before submitting an issue, please make sure it hasn't been already addressed by searching through the [existing and past issues](https://github.com/NVIDIA/Model-Optimizer/issues?q=is%3Aissue).**

## Describe the bug


- I'd used tensort8.6.1 to convert an existing onnx model with qdq nodes, say a model exported using Model Optimizer Toolkit, to run on a orin-equiped platform. However, the inference result of the engine is not good as expected, it showed a large discrepancy compared with the original qdq onnx. For convenience, i again tried trt8.6.1 and trt10.11 on my local x86-64 workstation to convert the onnx into engine, to verify if i can see the same phenomenon. Unfortunately, yes. I calculated cosine similarity between qdq onnx and generated engine according to trt8.6 and trt10.11 respectively.
<div style="display: flex; gap: 1px;justify-content: center;">
  <img src="https://github.com/user-attachments/assets/fd15278e-bab4-4865-84ce-b34607cfa80c" alt="trt10.11 qdq onnx vs engine" style="width: 15%;">
  <img src="https://github.com/user-attachments/assets/9ba34c89-353c-4948-b21d-f8f36f7beb21" alt="trt8.6.1 qdq onnx vs engine" style="width: 15%;">
</div>
As depicted above, we can indeed see a large discrepancy.  

### Steps/Code to reproduce  #bug



- convert command:
`trtexec --onnx=quant0206.onnx --saveEngine=quant_0206.engine --dumpProfile=true --best --verbose=true`

### Expected behavior

### Who can help?



- I really appreciate if anyone can help solve the problem, or some instructive advice on how to solve it. I guess the problem may be caused by the conversion process from onnx to trt in which precision lost occurred in some operators.  

## System information



- Container used (if applicable): nvcr.io/nvidia/tensorrt-llm/release:1.0.0
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ?  Ubuntu 20.04
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): GT3060
- GPU memory size: 12G
- Number of GPUs: 1
- Library versions (if applicable):
  - Python: 3.9
  - ModelOpt version or commit hash: 0.40.0
  - CUDA: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:15:15_PDT_2021
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0
  - PyTorch: 2.0
  - Transformers: not used
  - TensorRT-LLM: not used
  - ONNXRuntime: 1.19.2-gpu
  - TensorRT: trt8.6 or trt10.11
- Any other details that may help:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference discrepancy between onnxruntime and trt8.6(trt10.11 too) #869

Describe the bug

Steps/Code to reproduce #bug

Expected behavior

Who can help?

System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inference discrepancy between onnxruntime and trt8.6(trt10.11 too) #869

Description

Describe the bug

Steps/Code to reproduce #bug

Expected behavior

Who can help?

System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions