GitHub - joaopn/encoder-optimization-guide: Conversion and benchmark of encoder models to ONNX-FP16

Encoder Optimizations with ONNX

Many transformers encoder-type models can have considerably performance gains by being converted to ONNX. Some model families (BERT, RoBERTa, etc) can be further quantized to ONNX-FP16 for 2-3X performance gains with no accuracy penalty. This repo contain scripts to convert, validate accuracy and benchmark models.

Requirements

ONNX with CUDA requires a working torch installation with CUDA support, as well as transformers, optimum, pandas and tqdm. These can be installed with

pip install transformers optimum[onnxruntime-gpu] pandas tqdm --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

Alternatively, a conda environment bench with all the requirements can be created with

conda env create -f environment.yml
conda activate bench

A collection of ready-to-use ONNX-FP16 encoder models can be found here: https://huggingface.co/collections/joaopn/onnx-fp16

Benchmark

GPU Benchmark of the SamLowe/roberta-base-go_emotions model on a dataset of 10k random reddit comments, with pytorch (torch), ONNX (onnx), and O4-optimized FP16 ONNX versions (onnx-fp16).

Results

The ONNX FP16 optimized model is up to 3X faster than torch. The gain depends chiefly on bandwidth and FP32:FP16 ratio
Base ONNX is up to ~40% faster than torch

GPU results for the normal dataset

GPU/batch size	1	2	4	8	16	32
H200 (onnx-fp16)	778.86	1188.25	1809.96	2253.15	2138.84	1817.96
H200 (onnx)	425.58	760.48	1184.77	1465.01	1554.82	1333.68
H200 (torch)	267.06	406.27	698.71	928.64	923.54	773.35
L40S (onnx-fp16)	874.90	1387.24	2041.68	2312.79	2052.96	1601.76
L40S (onnx)	512.86	810.75	1171.87	1185.75	917.84	618.85
L40S (torch)	271.24	426.93	700.68	812.81	697.44	548.10
RTX 4090 (onnx-fp16)	1042.47	1042.47	2280.61	2551.59	2346.59	2346.59
RTX 4090 (onnx)	595.40	963.06	1232.12	1183.82	919.05	646.79
RTX 4090 (torch)	323.75	564.39	857.28	876.10	668.70	462.63
Tesla A10G (onnx-fp16)	600.00	879.20	1094.11	1082.87	943.09	767.02
Tesla A10G (onnx)	326.58	476.80	556.52	473.00	365.13	281.95
Tesla A10G (torch)	131.10	236.48	385.63	402.36	310.15	231.54
Tesla P40 (onnx-fp16)	263.18	286.72	255.36	200.65	148.89	108.92
Tesla P40 (onnx)	212.35	260.29	247.01	202.54	155.42	119.59
Tesla P40 (torch)	162.19	218.12	221.68	177.85	124.72	80.36

Table 1: GPU benchmark in messages/s for the normal dataset. Results may vary due to CPU tokenizer performance.

GPU results for the filtered (>200 characters) dataset

GPU/batch size	1	2	4	8	16	32
H200 (onnx-fp16)	643.63	875.59	1199.81	1302.29	1246.55	1208.13
H200 (onnx)	412.22	598.89	804.16	950.46	950.46	901.41
H200 (torch)	240.53	371.92	544.06	599.08	550.58	517.23
L40S (onnx-fp16)	726.27	961.86	1273.63	1305.42	1255.20	1079.12
L40S (onnx)	436.19	630.20	750.88	631.47	464.44	359.88
L40S (torch)	255.08	380.23	490.16	451.38	392.96	340.52
RTX 4090 (onnx-fp16)	856.65	1209.98	1438.25	1513.05	1395.42	1221.52
RTX 4090 (onnx)	494.28	673.83	740.03	610.06	472.35	382.72
RTX 4090 (torch)	302.38	476.46	548.32	450.82	338.37	273.01
Tesla A10G (onnx-fp16)	463.21	584.19	624.32	612.12	554.00	498.06
Tesla A10G (onnx)	255.55	312.77	290.70	239.00	200.90	176.20
Tesla A10G (torch)	126.82	209.08	245.60	205.70	167.53	141.90
Tesla P40 (onnx-fp16)	154.33	150.74	126.01	101.90	81.77	68.15
Tesla P40 (onnx)	138.25	142.59	125.45	103.09	86.84	75.27
Tesla P40 (torch)	117.11	128.19	113.87	88.03	64.88	47.76

Table 2: GPU benchmark in messages/s for the filtered dataset. Results may vary due to CPU tokenizer performance.

Dataset

The dataset consists of 10k randomly sampled Reddit comments from 12/2005-03/2023, from the Pushshift data dumps. It excludes comments with empty, [deleted] or [removed] content. Two options are provided:

normal: As described above
filtered: contains only comments with >200 characters.

Usage

To run the benchmarks, use the run_benchmark.py script:

python run_benchmark.py --model [torch, onnx or onnx-fp16] --device [gpu or cpu]

Arguments:

model (required): Model backend to use, either "torch" for torch or "onnx" for ONNX Runtime.
device (required): Device type to use, either "gpu" or "cpu"
dataset: Dataset variant to use, either "normal" or "filtered" (default: "normal").
gpu: ID of the GPU to use (default: 0).
batches: Comma-separated batch sizes to run (default: "1,2,4,8,16,32").
threads: Specify the number of CPU threads to use (default: 1).

The scripts will output the number of messages processed per second for each batch size.

Model Export

To export and optimize a HuggingFace model to ONNX FP16 format, use the export_onnx.py script:

python export_onnx.py <model_id> [OPTIONS]

This script:

Exports a HuggingFace model to ONNX with FP16 optimization (O4 config)
Benchmarks it against the original PyTorch model on 10k Reddit comments
Generates a README with accuracy statistics
Optionally uploads the optimized model to HuggingFace Hub

Arguments:

model_id (required): HuggingFace model ID (e.g., "SamLowe/roberta-base-go_emotions")
--save-dir: Directory to save the optimized model (default: "./{model_name}-onnx-fp16")
--batch-size: Batch size for benchmarking (default: 1)
--hf-token: HuggingFace API token for upload
--no-upload: Skip the upload prompt and don't upload to HuggingFace
--disable-shape-inference: Disable shape inference during optimization (recommended for very large models)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
export_onnx.py		export_onnx.py
run_benchmark.py		run_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Encoder Optimizations with ONNX

Requirements

Benchmark

Results

Dataset

Usage

Model Export

About

Uh oh!

Releases

Packages

Languages

joaopn/encoder-optimization-guide

Folders and files

Latest commit

History

Repository files navigation

Encoder Optimizations with ONNX

Requirements

Benchmark

Results

Dataset

Usage

Model Export

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages