This repository contains the replication package for the paper:
Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Authors: Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab
Published at: arXiv.org, 2025
The goal of this repository is to ensure reproducibility of all experiments, figures, and results presented in the paper.
git clone https://github.com/mrsumitbd/RealClassEval-Replication.git
cd RealClassEval-ReplicationRealClassEval-Replication/
├── notebooks/ # Jupyter notebooks for experiments, analysis, figures
│ └── plot_generator.ipynb
│
├── src/ # Python source code (modules, utilities, pipelines)
│ ├── __init__.py
│ ├── rag/
│ ├── ...
│ └── utils.py
│
├── data/ # Placeholder for datasets and metadata
│ ├── functional_correctness_data/
│ └── generated_code/
│
├── results/ # Output results (figures, metrics, etc.)
│ ├── rq1/
│ ├── ...
│ └── rq4/
│
├── rag_experiments/ # Stores all files generated during running rag
│
├── functional_correctness_test_folder/ # This is where the functional correctness test happened. Kept is separate for easier access and organization
│
├── setup.sh # Setup script for Linux/macOS
├── .gitignore # Ignore unnecessary files
├── .env.example # Template for environment variables
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment file
├── README.md # Documentation
└── LICENSE # License file
bash setup.shThis will:
- Verify that Python 3.11 is installed
- Create a virtual environment in venv/
- Install all dependencies from requirements.txt
After running the script, activate the environment manually:
source venv/bin/activatepython3.11 -m venv venv
source venv/bin/activate # On macOS/Linuxpip install --upgrade pip
pip install -r requirements.txtIf you prefer Conda, use the provided environment.yml:
# Create the environment
conda env create -f environment.yml
# Activate it
conda activate OpenClassGen-replicationAfter setting up the environment (using one of the three options above), create a .env file in the root directory by copying the .env.example:
cp .env.example .env # Linux/macOSIn addition to the Python environment, some scripts require external tools.
This project uses cloc to count lines of code.
You need to install it separately on your system.
If you use Homebrew:
brew install clocsudo apt-get update
sudo apt-get install clocsudo dnf install clocOnce installed, you can verify with:
cloc --versionDatasets are included in the data/ folder.
All results (figures, tables) are stored in the results/ directory.
Pre-generated results are provided for reference where possible.
This project is licensed under the MIT License.
If you use this replication package, please cite our paper:
@misc{rahman2025syntheticbenchmarksevaluatingllm,
title={Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation},
author={Musfiqur Rahman and SayedHassan Khatoonabadi and Emad Shihab},
year={2025},
eprint={2510.26130},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2510.26130},
}