This folder contains scripts for evaluating the performance of various retrieval models (Single, RRF, Two-Stage) and measuring their energy consumption.
Results/: Directory where evaluation results (JSON files) are typically saved.energy.py: Utility to measure GPU energy consumption.time_util.py: Utility to measure execution time (supports CUDA synchronization).
These scripts query the retrieval endpoints and save the results (top-k images, timing, energy) to JSON files.
Purpose: Runs queries against a single endpoint or an RRF fusion of endpoints. Measures time and energy. Usage:
python3 generate_metadata.py --dataset <dataset> --suffix <suffix> [--rerankingmodel <model>] [--k <k>]- Interactive mode: Run without arguments to interactively select "single" or "rrf" mode.
- Arguments:
--dataset:cocoorflickr--suffix: Suffix for the output filename (e.g.,_0).--rerankingmodel: Optional neural reranking model (e.g.,blip2).--k: Number of results to fetch (default from config).
Purpose: Batch runs generate_metadata.py for all single endpoints defined in the config.
Usage:
python3 generate_metadata_all.py- Iterates through endpoints and runs evaluation.
- Updates
energy_log.jsonwith batch energy consumption.
Purpose: Runs queries for a two-stage retrieval pipeline (e.g., Stage 1: CLIP, Stage 2: UniIR). Usage:
python3 generate_metadata_two_stage.py --dataset <dataset> \
--stage1_model <model> --stage1_core_type <type> \
--stage2_model <model> --stage2_core_type <type> \
--db <solr|faiss> [--reranking_model <model>]- Example:
python3 generate_metadata_two_stage.py --dataset coco --stage1_model clip --stage1_core_type text --stage2_model uniir --stage2_core_type joint-image-text --db faiss
Purpose: Batch runner for generate_metadata_two_stage.py.
Usage:
python3 two_stage_all.py- Configured via
RUN_LISTvariable inside the script. - Supports "SIMPLE" runs and "RERANK" runs.
Purpose: Generates RRF results for combinations of endpoints (Solr/Faiss). Usage:
python3 rrf_evaluation.py- Automatically generates pairs of endpoints for RRF fusion.
Purpose: Similar to rrf_evaluation.py but uses an explicit list of RRF pairs defined in RRF_PAIRS.
Usage:
python3 rrf_evaluation2.pyThese scripts calculate Recall@1, Recall@5, and Recall@10 from the generated JSON result files.
Purpose: Calculates metrics for a single result JSON file. Usage:
python3 evaluation_json.py --result <path_to_json> --dataset <dataset>Purpose: Batch evaluates all result_twoStage*.json files in a directory and saves a CSV report.
Usage:
python3 evaluation_json_all.py --results_dir <dir> --output_csv <file.csv>Purpose: Batch evaluates all results_rrf*.json files in a directory and saves a CSV report.
Usage:
python3 evaluation_json_rrf_all.py --results_dir <dir> --output_csv <file.csv>evaluation.py: Simple script to evaluate a single service URL (hardcoded in script) against ground truth.evalation_updated.py: Updated version of evaluation logic, supports single and RRF modes via hardcoded endpoints.test_eval.py: Quick test script for specific hardcoded endpoints (Flava/UniIR).test_evalu.py: Evaluation script for "preflmr" or similar variants, supportsvariantparameter.
requeststqdmpandaspynvml(for energy measurement)torch(for time synchronization)- Project config (
Benchmark.config.config_utils)
- Ensure the backend services (Solr/Faiss/Flask) are running before executing metadata generation scripts.
- Scripts often assume a specific directory structure for config and data (e.g.,
/mnt/storage/...). Checksys.path.appendand config paths if running in a new environment.