Important
This repository is now focused on pipeline evaluation for visual document retrieval tasks.
All other functionalities (vision retriever evaluation, legacy benchmarks) are kept for reproducibility purposes but are deprecated and no longer actively maintained.
Pipeline evaluation allows you to evaluate complete end-to-end retrieval systems on the ViDoRe v3 benchmark datasets. Unlike traditional retriever evaluation that focuses on individual model components, pipeline evaluation lets you test:
- Multi-stage retrieval systems (e.g., retrieve + rerank)
- Hybrid approaches (e.g., dense + sparse retrieval fusion)
- Custom preprocessing pipelines (e.g., OCR → chunking → embedding)
- Arbitrary retrieval logic that goes beyond standard dense/sparse retrievers
This repository serves as the primary community results repository for visual document retrieval benchmarks using complex pipelines. We encourage researchers and practitioners to submit their pipeline evaluation results to create a centralized location where the community can compare different approaches and track progress on ViDoRe v3 datasets.
To contribute your pipeline results to the leaderboard:
-
Run evaluations using this framework on the ViDoRe v3 datasets english splits (
--language englishin cli). It tracks raw scores as well as indexing and search computing times. -
Open a Pull Request with the following:
- Results files: Add your JSON result files to the
results/metricsfolder, organized as:results/metrics/your_pipeline_name/ ├── vidore_v3_hr.json ├── vidore_v3_finance_en.json ├── vidore_v3_industrial.json └── ... (other datasets) - Pipeline description: Include a
description.jsonfile in the same PR that describes the architecture used. A pipeline is represented as a graph of a set of modules (OCR, retriever, reranker, mcp server... linked together via edges) Some pipeline descriptions files example are written inresults/pipeline_descriptions
We encourage adding as much hardware information as possible in the description to enable the community to get a feel about the latency of each pipeline.
- Results files: Add your JSON result files to the
pip install vidore-benchmarkList all ViDoRe v3 datasets:
vidore-benchmark pipeline list-datasetsAvailable datasets:
vidore/vidore_v3_hr- Human Resources documentsvidore/vidore_v3_finance_en- Financial documents (English)vidore/vidore_v3_industrial- Industrial documentsvidore/vidore_v3_pharmaceuticals- Pharmaceutical documentsvidore/vidore_v3_computer_science- Computer Science documentsvidore/vidore_v3_energy- Energy sector documentsvidore/vidore_v3_physics- Physics documentsvidore/vidore_v3_finance_fr- Financial documents (French)
You can evaluate any pipeline that inherits from BasePipeline:
Some pipelines are already implemented in the pipeline_implementations folder.
Evaluate your own pipeline implementation:
vidore-benchmark pipeline evaluate \
--dataset-name vidore/vidore_v3_hr \
--module-path path/to/my_pipeline.py \
--class-name MyCustomPipeline \
--language english \
--pipeline-args '{"model_name": "my-model"}'Your pipeline file (my_pipeline.py):
from vidore_benchmark.pipeline_evaluation import BasePipeline
class MyCustomPipeline(BasePipeline):
def __init__(self, model_name):
self.model_name = model_name
# Initialize your model here
def index(self, corpus_ids, corpus_images, corpus_texts):
# Indexing function to process corpus, should store anything
# relevant as class attributes
self.corpus_ids = corpus_ids
...
def search(self, query_ids, queries):
# Your search logic, returns scores dict (see BasePipeline file for description)
return {query_id: {corpus_id: score}}Some datasets contain multilingual queries. You can filter by language:
vidore-benchmark pipeline evaluate \
--dataset-name vidore/vidore_v3_hr \
--pipeline-type random \
--language englishEvaluate your pipeline on all ViDoRe v3 datasets:
With built-in pipeline:
vidore-benchmark pipeline evaluate-all \
--pipeline-type random \
--pipeline-args '{"seed": 42}' \
--output-dir results/With custom pipeline:
vidore-benchmark pipeline evaluate-all \
--module-path my_pipeline.py \
--class-name MyCustomPipeline \
--output-dir results/To evaluate a custom pipeline, inherit from BasePipeline and implement the index() and search() methods:
from path_to_pipeline import MyCustomPipeline
from vidore_benchmark.pipeline_evaluation import (
load_vidore_dataset,
evaluate_retrieval,
aggregate_results,
)
# Load dataset
query_ids, queries, corpus_ids, corpus_images, corpus_texts, qrels = load_vidore_dataset(
dataset_name="vidore/vidore_v3_hr",
split="test"
)
# Initialize your pipeline
pipeline = MyCustomPipeline(retriever=my_retriever, reranker=my_reranker)
# Run evaluation
results = evaluate_retrieval(
pipeline=pipeline,
query_ids=query_ids,
queries=queries,
corpus_ids=corpus_ids,
corpus_images=corpus_images,
corpus_texts=corpus_texts,
qrels=qrels,
metrics=["ndcg_cut_10", "recall_10"]
)
# Get aggregate scores
aggregated = aggregate_results(results)
print(f"NDCG@10: {aggregated['ndcg_cut_10']:.4f}")Some examples of pipeline implementations can be found in the pipeline_implementations folder
Pipelines can optionally return additional tracking information alongside retrieval results. This is useful for monitoring costs, timing, resource usage, or other custom metrics:
from typing import Dict, List, Any, Optional, Tuple
class PipelineWithMetrics(BasePipeline):
def index(
self,
corpus_ids: List[str],
corpus_images: List[Any],
corpus_texts: List[str],
) -> None:
# Indexing logic
...
def search(
self,
query_ids: List[str],
queries: List[str],
) -> Tuple[Dict[str, Dict[str, float]], Optional[Dict[str, Any]]]:
"""
Return both retrieval results and optional tracking metrics.
Returns:
Tuple of (results, infos) where infos can contain:
- Cost tracking (e.g., API costs, GPU hours)
- Granular timing information
- Resource usage (num_gpus, memory, etc.)
- Model-specific metadata
"""
# Your retrieval logic here
results = {...}
# Optional: track additional metrics
infos = {
"estimated_cost_usd": 0.05,
"num_gpus": 1,
"total_time_ms": 1234.5,
"model_name": "my-model-v1",
}
return results, infosThe infos dictionary will be stored in the evaluation results under the _infos key. This is completely optional - pipelines can still return just the results dictionary for backward compatibility:
class SimplePipeline(BasePipeline):
def search(...) -> Dict[str, Dict[str, float]]:
# Just return results, no tracking needed
return resultsSee example_pipelines/pipeline_with_metrics.py for a complete example.
from vidore_benchmark.pipeline_evaluation import (
load_vidore_dataset,
print_dataset_info,
get_available_datasets,
)
# List available datasets
datasets = get_available_datasets()
print(datasets)
# Load and inspect a dataset
query_ids, queries, corpus_ids, corpus, qrels = load_vidore_dataset(
"vidore/vidore_v3_industrial"
)
print_dataset_info(
dataset_name="vidore/vidore_v3_industrial",
query_ids=query_ids,
queries=queries,
corpus_ids=corpus_ids,
corpus=corpus,
qrels=qrels,
)You can specify custom metrics to evaluate if you want to:
results = evaluate_retrieval(
pipeline=pipeline,
query_ids=query_ids,
queries=queries,
corpus_ids=corpus_ids,
corpus=corpus,
qrels=qrels,
metrics=[
"ndcg_cut_5",
"ndcg_cut_10",
"recall_5",
"recall_10",
"map",
]
)All metrics supported by pytrec_eval are available.
The pipeline evaluation framework consists of:
BasePipeline: Abstract base class for implementing custom pipelines- Dataset Loaders: Functions to load ViDoRe v3 datasets from HuggingFace
- Evaluator: Uses
pytrec_evalto compute retrieval metrics - CLI: Commands for evaluating any custom pipeline
vidore_benchmark/
├── pipeline_evaluation/
│ ├── base_pipeline.py # BasePipeline abstract class
│ ├── dataset_loader.py # ViDoRe v3 dataset loading
│ ├── evaluator.py # Evaluation orchestration
│ ├── utils.py # Helper utilities
└── cli/
└── pipeline_evaluation.py # CLI for pipeline evaluation
This repository previously focused on evaluating vision retrievers on the ViDoRe benchmarks v1 and v2. All code related to these functionalities is still available but deprecated:
- Vision Retriever Evaluation: See
README_OLD.md - ViDoRe Benchmarks v1/v2: Now maintained in MTEB
- Model Implementations: Available in
src/vidore_benchmark/retrievers/(for reference only)
- Using MTEB for vision retriever evaluation on ViDoRe v1/v2
- Using this framework for pipeline evaluation on ViDoRe v3
For reproducibility of published results, see REPRODUCIBILITY.md.
We welcome contributions for:
- New example pipelines
- Additional evaluation results
- Dataset utilities
- Documentation improvements
Please open an issue or PR on GitHub.
If you use this framework or the ViDoRe benchmark in your research, please cite:
ColPali: Efficient Document Retrieval with Vision Language Models
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval
@misc{macé2025vidorebenchmarkv2raising,
title={ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval},
author={Quentin Macé and António Loison and Manuel Faysse},
year={2025},
eprint={2505.17166},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2505.17166},
}ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios
@misc{loison2026vidore,
title={ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios},
author={Loison, Ant{\'o}nio and Mac{\'e}, Quentin and Edy, Antoine and Xing, Victor and Balough, Tom and Moreira, Gabriel and Liu, Bo and Faysse, Manuel and Hudelot, C{\'e}line and Viaud, Gautier},
journal={arXiv preprint arXiv:2601.08620},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.