Image Loading Benchmark

Overview

Benchmarks the speed of reading JPEG images and converting them to RGB numpy arrays across popular Python libraries. Targets machine learning training pipelines, measured across multiple CPU architectures (Intel Xeon, AMD EPYC, ARM Neoverse, Apple M-series) using the ImageNet validation set.

Results

The plots and tables below are generated from output/<platform>/*.json. To refresh after a new run:

imread-benchmark plot --input output --output docs/assets/benchmarks
imread-benchmark render-readme

The plot labels show img/s and % of the fastest decoder on that CPU, so darker cells are the winners for that platform.

Single-thread decode throughput (img/s)

Pure decode speed with one thread, bytes pre-loaded to memory. Bold = best per platform.

Library	AMD EPYC 9B14	AMD EPYC 9B45	Intel Xeon Platinum 8581C	Neoverse-N1	Neoverse-V2
`simplejpeg`	690	857	735	456	662
`turbojpeg`	640	818	708	426	613
`jpeg4py`	636	760	699	423	611
`kornia-rs`	642	761	664	391	629
`opencv`	664	841	721	445	645
`imagecodecs`	677	775	723	457	661
`pyvips`	420	586	462	261	413
`pillow`	537	726	577	360	551
`skimage`	475	661	525	326	499
`imageio`	496	599	524	335	506
`torchvision`	621	864	712	440	643
`tensorflow`	596	836	689	268	391

Peak DataLoader throughput (img/s)

Best images_per_second across num_workers ∈ {0, 2, 4, 8} for each library × platform, using a PyTorch DataLoader with batch_size=32. Cell format: img/s @ Nw. Bold = best per platform.

Library	AMD EPYC 9B14	AMD EPYC 9B45	Intel Xeon Platinum 8581C	Neoverse-N1	Neoverse-V2
`simplejpeg`	1,521 @ 4w	2,739 @ 8w	1,754 @ 8w	1,557 @ 8w	2,421 @ 8w
`turbojpeg`	1,535 @ 4w	2,800 @ 8w	1,710 @ 8w	1,347 @ 4w	2,389 @ 8w
`jpeg4py`	1,443 @ 4w	2,453 @ 8w	1,651 @ 8w	1,411 @ 8w	2,312 @ 8w
`kornia-rs`	1,327 @ 8w	2,394 @ 8w	1,422 @ 8w	1,260 @ 8w	1,951 @ 8w
`opencv`	1,457 @ 4w	2,814 @ 8w	1,707 @ 8w	1,419 @ 8w	2,414 @ 8w
`imagecodecs`	1,543 @ 4w	2,476 @ 8w	1,677 @ 8w	1,443 @ 8w	2,242 @ 8w
`pillow`	1,283 @ 4w	2,465 @ 8w	1,565 @ 8w	1,387 @ 8w	2,350 @ 8w
`skimage`	1,238 @ 4w	2,536 @ 8w	1,615 @ 8w	1,388 @ 8w	2,315 @ 8w
`imageio`	1,273 @ 4w	2,324 @ 8w	1,643 @ 8w	1,466 @ 8w	2,561 @ 8w
`torchvision`	1,596 @ 8w	2,920 @ 8w	1,612 @ 4w	1,504 @ 8w	2,557 @ 8w

5 platforms · 50,000 images · 5 runs each · latest run 2026-04-22

GitAds Sponsored

Important Note on Image Conversion

All decoders output (H, W, 3) uint8 RGB numpy arrays for a fair comparison. Libraries that default to other formats (OpenCV → BGR, torchvision → CHW tensor, TensorFlow → EagerTensor) include a conversion step. Note that in real ML pipelines the conversion is often unnecessary.

Benchmark Modes

Memory mode (default): images are pre-loaded as bytes before the timed loop. This measures pure decode throughput with no disk I/O.

Disk mode: each decode call reads the file from disk. Includes I/O latency.

Dataset

ImageNet validation set — 50,000 JPEG images, ~500×400px.

# Download
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
mkdir -p imagenet/val
tar -xf ILSVRC2012_img_val.tar -C imagenet/val

System Requirements (macOS)

brew install jpeg-turbo   # required by PyTurboJPEG (pure-python ctypes binding)

pyvips ships its own bundled libvips via the pyvips-binary PyPI wheel, so no brew install vips is needed. simplejpeg wheels bundle libjpeg-turbo. On Linux you'll still need apt install libjpeg-turbo8-dev libturbojpeg0 (see gcp/vm_startup.sh), since jpeg4py is built from sdist.

Installation

# Install uv if needed
pip install uv

# Install the orchestrator (control-plane) into a venv.
# Per-library worker venvs (mainstream / tensorflow) are created lazily on
# first run, with the right libjpeg-turbo / libvips deps.
uv venv && source .venv/bin/activate
uv pip install -e .

Running the Benchmark

# What would run on this machine?
imread-benchmark list-libs

# Single + DataLoader for every supported decoder, default 50k images
imread-benchmark run --data-dir /path/to/imagenet/val

# Faster smoke run
imread-benchmark run --data-dir /path/to/imagenet/val \
    --num-images 2000 --num-runs 5 --dataloader-runs 2 \
    --workers 0,2

# Just one library, single-thread benchmark only
imread-benchmark run --data-dir /path/to/imagenet/val \
    --libs opencv --mode single

# Generate README plots from output/ JSONs
imread-benchmark plot --input output --output docs/assets/benchmarks

The CLI sets up venvs/<group>/ for each dependency group it needs. Subsequent runs reuse those venvs, so only the first invocation pays the install cost.

Running on Google Cloud

Spin up a benchmark VM on GCP, run everything against ImageNet from a GCS bucket, and have it self-delete when done:

./gcp/run.sh \
    --imagenet-bucket gs://my-bucket/imagenet/val \
    --results-bucket  gs://my-bucket/imread-results \
    --no-wait

Built venvs are cached in GCS (keyed by sha256(uv.lock)), so reruns on the same machine type skip the ~25-minute install. Use --force-rebuild to re-resolve PyPI without editing uv.lock. Full details, machine-type matrix, cost, and cache semantics: docs/gcp_benchmarks.md.

Results Structure

output/
└── darwin_Apple-M4-Max/
    ├── opencv_results.json
    ├── pillow_results.json
    ├── opencv_dataloader_results.json
    └── ...

Libraries Benchmarked

Direct libjpeg-turbo (fastest)

simplejpeg — CFFI binding; zero-copy decode from bytes
turbojpeg (PyTurboJPEG) — Python binding for libjpeg-turbo
jpeg4py — direct libjpeg-turbo binding (Linux only)
kornia-rs — Rust implementation using libjpeg-turbo
OpenCV (opencv-python-headless)

Comprehensive codec libraries

imagecodecs — uses libjpeg-turbo 3.x; prebuilt ARM64 wheels
pyvips — libvips bindings (bundled in wheels). Single-thread only; the libvips threadpool deadlocks under fork-based PyTorch DataLoader, so dataloader benchmarks are skipped on every platform.

Standard libjpeg

Pillow
scikit-image
imageio

Note: Pillow-SIMD was previously included but dropped 2026-04 — upstream is abandoned (last release 2023-05), no Linux wheels, and its historical SIMD speedup is now matched by jpeg4py / simplejpeg / kornia-rs. Full rationale in docs/gcp_benchmarks.md.

ML framework components

torchvision
tensorflow

Performance Considerations

All benchmarks run single-threaded unless using the DataLoader benchmark
Memory mode is the recommended baseline — it isolates decode speed from storage
Results based on ImageNet JPEG images (~500×400px)

Recommendations

High-throughput ML training

Use simplejpeg, turbojpeg, or kornia-rs for maximum single-thread decode speed
Use the DataLoader benchmark to find the best num_workers for your CPU

Cross-platform

kornia-rs and opencv offer the most consistent cross-platform performance

Feature-rich applications

opencv remains the best choice when you need more than just JPEG decoding

Development

# Run tests
uv run pytest tests/ -v

# Run linters
uv run pre-commit run --all-files

See CONTRIBUTING.md for how to add a new decoder.

Citation

If you found this work useful, please cite:

@misc{iglovikov2025speed,
      title={Need for Speed: A Comprehensive Benchmark of JPEG Decoders in Python},
      author={Vladimir Iglovikov},
      year={2025},
      eprint={2501.13131},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      doi={10.48550/arXiv.2501.13131}
}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github		.github
docs		docs
gcp		gcp
imread_benchmark		imread_benchmark
output		output
tests		tests
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Loading Benchmark

Overview

Results

Single-thread decode throughput (img/s)

Peak DataLoader throughput (img/s)

GitAds Sponsored

Important Note on Image Conversion

Benchmark Modes

Dataset

System Requirements (macOS)

Installation

Running the Benchmark

Running on Google Cloud

Results Structure

Libraries Benchmarked

Direct libjpeg-turbo (fastest)

Comprehensive codec libraries

Standard libjpeg

ML framework components

Performance Considerations

Recommendations

High-throughput ML training

Cross-platform

Feature-rich applications

Development

Citation

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Image Loading Benchmark

Overview

Results

Single-thread decode throughput (img/s)

Peak DataLoader throughput (img/s)

GitAds Sponsored

Important Note on Image Conversion

Benchmark Modes

Dataset

System Requirements (macOS)

Installation

Running the Benchmark

Running on Google Cloud

Results Structure

Libraries Benchmarked

Direct libjpeg-turbo (fastest)

Comprehensive codec libraries

Standard libjpeg

ML framework components

Performance Considerations

Recommendations

High-throughput ML training

Cross-platform

Feature-rich applications

Development

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages