π Paper | π€ Dataset | π οΈ Tools | π Website | πΎ Cache (Baseline) | πΎ Cache (Full)
SimulCost is a cost-aware benchmark and toolkit for evaluating how well LLM agents tune simulation parameters under realistic computational budgets. Unlike prior evaluations that focus on correctness (and sometimes token cost) while implicitly treating tool usage as βfree,β SimulCost explicitly measures both: (1) whether a proposed configuration meets an accuracy target and (2) how much simulation compute it consumes.
The benchmark covers 12 physics simulators across fluid dynamics, solid mechanics, and plasma physics, with single-round (one-shot initial guess) and multi-round (trial-and-error with feedback) settings. Tool cost is defined in a platform-independent way (analytical cost models / FLOPs) to make results reproducible and comparable.
- Environment Setup
- Tasks and Zero-Shot Support
- Generate Questions
- Generate Benchmark Datasets
- Configure Model Providers
- Customize Simulation Results Directory
- Pre-cached Simulation Results
- Run Inference
- Resume Functionality
- Evaluate Performance
- Tabulate Results
- Script Usage Guide
To clone the repository and initialize the submodule, use the following commands:
# Clone repository with submodule initialization
git clone --recursive https://github.com/Rose-STL-Lab/SimulCost-Bench.git
# If you've already cloned the repository without --recursive, run the following to initialize the submodule:
cd SimulCost-Bench
git submodule update --init --recursive--recursive flag when cloning to ensure all dependencies are initialized properly.
If you encounter submodule issues after cloning, try these commands:
# Sync submodule URLs (if they've changed)
git submodule sync --recursive
# Re-initialize all submodules
git submodule update --init --recursive# Create Conda environment
conda env create -f environment.yml
conda activate simulcost# Install dependencies with Poetry
poetry install --no-rootNote: To run 1D EPOCH PIC simulations, see the EPOCH Setup Guide for additional configuration requirements.
Note: To run 2D Euler gas dynamics simulations, see the Euler 2D Setup Guide for additional configuration requirements.
Note: To run 2D FEM simulations with FastIPC solver, see the FEM 2D Setup Guide for compilation and configuration requirements.
A pre-built Docker image is available with all dependencies and solvers compiled. This lets you skip the Conda / Poetry / solver setup entirely.
1. Clone the repository:
git clone --recursive https://github.com/Rose-STL-Lab/SimulCost-Bench.git
cd SimulCost-Bench2. Pull the image:
docker pull ghcr.io/leo-lsc/simulcost-bench:latest
docker tag ghcr.io/leo-lsc/simulcost-bench:latest simulcost-benchOr build locally with docker build -t simulcost-bench .
3. Run the container (Dev / hot-reload):
This mode mounts frequently-edited code directories from the host into the container so changes take effect immediately, without rebuilding the image.
Do NOT mount costsci_tools/ unless you also build the solvers on the host β otherwise you may overwrite the compiled binaries shipped in the image.
docker run --rm -it \
--env-file .env \
-v $(pwd)/scripts:/app/scripts \
-v $(pwd)/inference:/app/inference \
-v $(pwd)/evaluation:/app/evaluation \
-v $(pwd)/configs:/app/configs \
-v $(pwd)/custom_model:/app/custom_model \
-v $(pwd)/sim_res:/app/sim_res \
-v $(pwd)/eval_results:/app/eval_results \
-v $(pwd)/results_model_attempt:/app/results_model_attempt \
-v $(pwd)/log_model_tool_call:/app/log_model_tool_call \
-v $(pwd)/data:/app/data \
simulcost-bench--env-file .envpasses your API keys (e.g. OpenAI / AWS Bedrock) into the container.-vmounts persist results to the host β without them, all output is lost when the container exits.
Note: The FEM 2D solver (FastIPC) is compiled with
-mavx -mavx2 -mfma. Your CPU (both build and run host) must support AVX/AVX2/FMA instructions.
Click to expand the full task support table
The table below summarizes the available tasks for each simulation type and indicates whether each task supports zero-shot inference.
| Simulation Type | Task Type | Iterative & Zero-Shot |
|---|---|---|
| 1D Heat Transfer | cfl |
β Supported |
| 1D Heat Transfer | n_space |
β Supported |
| 2D Steady Heat Transfer | dx |
β Supported |
| 2D Steady Heat Transfer | error_threshold |
β Supported |
| 2D Steady Heat Transfer | relax |
β Only Zero-Shot |
| 2D Steady Heat Transfer | t_init |
β Only Zero-Shot |
| 1D Burgers | cfl |
β Supported |
| 1D Burgers | k |
β Supported (Compositional) |
| 1D Burgers | beta |
β Supported (Compositional) |
| 1D Burgers | n_space |
β Supported |
| 1D Euler | cfl |
β Supported |
| 1D Euler | beta |
β Supported (Compositional) |
| 1D Euler | k |
β Supported (Compositional) |
| 1D Euler | n_space |
β Supported |
| 2D Navier-Stokes Channel | mesh_x |
β Supported |
| 2D Navier-Stokes Channel | mesh_y |
β Supported |
| 2D Navier-Stokes Channel | omega_u |
β Only Zero-Shot |
| 2D Navier-Stokes Channel | omega_v |
β Only Zero-Shot |
| 2D Navier-Stokes Channel | omega_p |
β Only Zero-Shot |
| 2D Navier-Stokes Channel | diff_u_threshold |
β Only Zero-Shot |
| 2D Navier-Stokes Channel | diff_v_threshold |
β Only Zero-Shot |
| 2D Navier-Stokes Channel | res_iter_v_threshold |
β Only Zero-Shot |
| 2D Navier-Stokes Transient | resolution |
β Supported |
| 2D Navier-Stokes Transient | cfl |
β Supported |
| 2D Navier-Stokes Transient | relaxation_factor |
β Only Zero-Shot |
| 2D Navier-Stokes Transient | residual_threshold |
β Only Zero-Shot |
| 1D EPOCH PIC | nx |
β Supported |
| 1D EPOCH PIC | npart |
β Supported |
| 1D EPOCH PIC | dt_multiplier |
β Supported (Compositional) |
| 1D EPOCH PIC | field_order |
β Supported (Compositional) |
| 1D EPOCH PIC | particle_order |
β Supported (Compositional) |
| 2D MPM | nx |
β Supported |
| 2D MPM | n_part |
β Supported |
| 2D MPM | cfl |
β Supported |
| 1D Diffusion-Reaction | cfl |
β Supported |
| 1D Diffusion-Reaction | n_space |
β Supported |
| 1D Diffusion-Reaction | tol |
β Supported |
| 2D Euler Gas Dynamics | n_grid_x |
β Supported |
| 2D Euler Gas Dynamics | cfl |
β Supported |
| 2D Euler Gas Dynamics | cg_tolerance |
β Supported |
| Hasegawa-Mima Nonlinear | N |
β Supported |
| Hasegawa-Mima Nonlinear | dt |
β Supported |
| Hasegawa-Mima Linear | N |
β Supported |
| Hasegawa-Mima Linear | dt |
β Supported |
| Hasegawa-Mima Linear | cg_atol |
β Supported |
| 2D FEM | dx |
β Supported |
| 2D FEM | cfl |
β Supported |
Generate question templates for different physics domains and task types.
Click to view commands for all simulation types
# 1D Heat Transfer
python qs_gen/1D_heat_transfer.py
# 2D Steady Heat Transfer
python qs_gen/2D_heat_transfer.py
# Burgers 1D Equation with 2nd Order Roe Method
python qs_gen/1D_burgers.py
# Euler 1D Equations with 2nd Order MUSCL-Roe Method
python qs_gen/1D_euler.py
# 2D Navier-Stokes Channel Flow with SIMPLE Algorithm
python qs_gen/2D_ns.py
# 2D Navier-Stokes Transient Flow with Taichi Framework
python qs_gen/2D_ns_transient.py
# 1D EPOCH Particle-in-Cell Simulation
python qs_gen/1D_epoch.py
# 2D Material Point Method (MPM) Simulation
python qs_gen/2D_mpm.py
# 1D Diffusion-Reaction Equations with Newton Method
python qs_gen/1D_diff_react.py
# 2D Euler Equations with Advection-Projection Method
python qs_gen/2D_euler.py
# Hasegawa-Mima Nonlinear Equation with Pseudo-Spectral Method
python qs_gen/hasegawa_mima_nonlinear.py
# Hasegawa-Mima Linear Equation with RK4 and CG Solver
python qs_gen/hasegawa_mima_linear.py
# 2D Finite Element Method with Implicit Newton Solver
python qs_gen/2D_fem.pyOutput: Generated questions are saved to data/{simulation}/{task}/{precision_level}/question.json
Create complete benchmark datasets with problem instances and ground truth solutions.
Click to view commands for all simulation types
# 1D Heat Transfer
python dataset_gen/oneD_heat_transfer.py
# 2D Steady Heat Transfer
python dataset_gen/twoD_heat_transfer.py
# Burgers 1D Equation with 2nd Order Roe Method
python dataset_gen/oneD_burgers.py
# Euler 1D Equations with 2nd Order MUSCL-Roe Method
python dataset_gen/oneD_euler.py
# 2D Navier-Stokes Channel Flow with SIMPLE Algorithm
python dataset_gen/twoD_ns.py
# 2D Navier-Stokes Transient Flow with Taichi Framework
python dataset_gen/twoD_ns_transient.py
# 1D EPOCH Particle-in-Cell Simulation
python dataset_gen/oneD_epoch.py
# 2D Material Point Method (MPM) Simulation
python dataset_gen/twoD_mpm.py
# 1D Diffusion-Reaction Equations with Newton Method
python dataset_gen/oneD_diff_react.py
# 2D Euler Equations with Advection-Projection Method
python dataset_gen/twoD_euler.py
# Hasegawa-Mima Nonlinear Equation with Pseudo-Spectral Method
python dataset_gen/hasegawa_mima_nonlinear.py
# Hasegawa-Mima Linear Equation with RK4 and CG Solver
python dataset_gen/hasegawa_mima_linear.py
# 2D Finite Element Method with Implicit Newton Solver
python dataset_gen/twoD_fem.pyOutput: Datasets are saved to: data/{simulation}/{task}/{precision_level}/human_write/ directory
Configure API keys in your .env file:
# OpenAI API key
OPENAI_API_KEY=your_openai_api_key
# Google API key for Gemini
GOOGLE_API_KEY=your_google_api_key
# AWS API credentials for Bedrock
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION_NAME=your_aws_region_nameTwo configuration methods available:
Copy the example file and create your own configuration:
cp configs/custom_models.json.example configs/custom_models.jsonThen edit configs/custom_models.json to manage multiple custom models:
{
"custom_models": {
"qwen3_8b": {
"custom_code": "/path/to/custom_model/custom_inference.py",
"model_path": "/data/models/Qwen3-8B",
"custom_class": "Qwen3"
},
"llama3_7b": {
"custom_code": "/path/to/custom_model/custom_inference.py",
"model_path": "/data/models/Llama3-7B",
"custom_class": "Llama3"
}
}
}Set in your .env file:
custom_code="/path/to/custom_model/custom_inference.py"
model_path="/path/to/your/custom_model"
custom_class="CustomModel"π List Available Models:
python scripts/list_custom_models.pyFor detailed implementation guide, see Custom Model Integration Guide.
By default, simulation results are saved to sim_res/ in the current working directory. You can configure a custom absolute path for storing simulation results.
Add the following to your .env file:
# Optional: Specify custom directory for simulation results
# If not set, results will be saved to ./sim_res/ (relative path)
SIM_RES_BASE_DIR=/path/to/your/custom/directoryUse absolute path:
SIM_RES_BASE_DIR=/data/leo_work_newResults will be saved to: /data/leo_work_new/sim_res/...
Use default relative path:
# Comment out or remove the SIM_RES_BASE_DIR line
# SIM_RES_BASE_DIR=/data/leo_work_newResults will be saved to: ./sim_res/... (relative to working directory)
Docker users: If you run via Docker, you don't need
SIM_RES_BASE_DIR. Simply mount your desired host directory to the container'ssim_res/path when runningdocker run, e.g.-v /your/host/dir:/app/sim_res.
β‘ Pre-cached results (optional)
To help you skip long simulation runtimes, pre-computed simulation results are available on Hugging Face.
- Baseline cache (β22.5 GB): LeoLai689/SimulCost-baseline-sim_res
- Contains pre-computed results for all baseline experiments, and is much smaller to download.
| Simulation File | Size | Simulation Type |
|---|---|---|
burgers_1d.zip |
675 MB | 1D Burgers Equation |
diff_react_1d.zip |
293 MB | 1D Diffusion-Reaction |
epoch.zip |
3.65 GB | 1D EPOCH PIC |
euler_1d.zip |
3.86 GB | 1D Euler Equations |
euler_2d.zip |
1.54 GB | 2D Euler Gas Dynamics |
fem_2d.zip |
146 MB | 2D Finite Element Method |
hasegawa_mima_linear.zip |
1.45 GB | Hasegawa-Mima Linear |
hasegawa_mima_nonlinear.zip |
300 MB | Hasegawa-Mima Nonlinear |
heat_1d.zip |
182 MB | 1D Heat Transfer |
heat_2d.zip |
2.66 GB | 2D Steady Heat Transfer |
ns_channel_2d.zip |
96.7 MB | 2D Navier-Stokes Channel |
ns_transient_2d.zip |
6.56 GB | 2D Navier-Stokes Transient |
unstruct_mpm.zip |
1.13 GB | 2D Material Point Method |
- Full cache (β328 GB): LeoLai689/SimulCost-full-sim_res
- This is the complete simulation cache used in this project.
| Simulation File | Size | Simulation Type |
|---|---|---|
burgers_1d.tar.gz |
1.69 GB | 1D Burgers Equation |
diff_react_1d.tar.gz |
1.35 GB | 1D Diffusion-Reaction |
epoch.tar.gz |
3.65 GB | 1D EPOCH PIC |
euler_1d.tar.gz |
14.1 GB | 1D Euler Equations |
euler_2d.tar.gz |
5.64 GB | 2D Euler Gas Dynamics |
fem2d.tar.gz |
877 MB | 2D Finite Element Method |
hasegawa_mima_linear.tar.gz |
10.8 GB | Hasegawa-Mima Linear |
hasegawa_mima_nonlinear.tar.gz |
1.32 GB | Hasegawa-Mima Nonlinear |
heat_1d.tar.gz |
3.46 GB | 1D Heat Transfer |
heat_steady_2d.tar.gz |
5.19 GB | 2D Steady Heat Transfer |
ns_transient_2d.tar.gz |
58.8 GB | 2D Navier-Stokes Transient |
unstruct_mpm.tar.gz |
221 GB | 2D Material Point Method |
Download the files you need from the dataset page and place/extract them into your simulation results directory
(e.g., ./sim_res/ or your SIM_RES_BASE_DIR).
If you're getting started, begin with the βFastβ group to avoid wasting time on very slow simulations.
- Fast (recommended to start):
burgers_1d,diff_react_1d,heat_1d,fem_2d - Moderate:
euler_1d,euler_2d,ns_transient_2d,unstruct_mpm,hasegawa_mima_linear,hasegawa_mima_nonlinear,epoch - Slow:
heat_2d,ns_channel_2d(Can be very slowβ οΈ )
Execute LLM inference on benchmark datasets to generate predictions.
# Commercial API Models
python inference/langchain_LLM.py -p openai -m gpt-5-2025-08-07 -d heat_1d -t cfl -l medium -z
# Single Custom Model
python inference/langchain_LLM.py -p custom_model -m qwen3_8b -d heat_1d -t cfl -l medium -z
# Multiple Custom Models (use batch scripts)
bash scripts/inference_eval/inference_eval_heat_1d.shParameters:
-p: LLM provider (openai,gemini,bedrock,custom_model)-m: Model name/identifier-d: Dataset name-t: Problem task type-l: Precision level-z: Enable zero-shot mode--list-combinations: Show all valid dataset-task combinations and exit
Outputs:
- Results:
results_model_attempt/{dataset}/{precision_level}/{task}/ - Logs:
log_model_tool_call/{dataset}/{precision_level}/{task}/
π‘ Tip: Use --list-combinations to see all available dataset-task combinations:
python inference/langchain_LLM.py --list-combinationsOpenAI Reasoning Models: Setting reasoning_effort
For OpenAI reasoning models (e.g., GPT-5), you can control the reasoning effort by appending -re-{effort} to the model name:
- Syntax:
{model_name}-re-{effort} - Valid effort levels:
minimal,low,medium,high
Example:
# Use GPT-5 with minimal reasoning effort
python inference/langchain_LLM.py -p openai -m gpt-5-2025-08-07-re-minimal -d heat_1d -t cfl -l medium -zThe inference system includes automatic progress tracking and resume capabilities to handle long-running experiments gracefully.
- Automatic Progress Saving: Progress is saved after each completed sample
- Resume from Interruption: Continue from where you left off using the
--resumeflag
python inference/langchain_LLM.py -p custom_model -m qwen3_8b -d heat_1d -t cfl -l medium --resume- Location:
log_model_tool_call/{dataset}/{precision_level}/{task}/{flag}_{model_name}_progress.json - Content: Contains completed results and intermediate data for resuming
Compute performance metrics and accuracy scores for model predictions.
# Example (Heat 1D)
python evaluation/heat_1d/eval.py -m anthropic.claude-3-5-haiku-20241022-v1:0 -d heat_1d -t cfl -l medium -zParameters:
-m: Model name/identifier-d: Dataset name-t: Problem task type-l: Precision level (for heat_1d, heat_2d, burgers_1d and euler_1d: low, medium, high; default: medium)-z: Enable zero-shot mode
Outputs:
- JSON results:
eval_results/{dataset}/{task}/{precision_level}/ - Parquet dataframes:
eval_results/{dataset}/dataframes/
π For advanced data analysis workflows and parquet usage, see Evaluation Documentation.
Generate summary tables and comparative analysis across different models and tasks.
python evaluation/tabulate.py -d heat_1d
python evaluation/tabulate.py -d heat_2d
python evaluation/tabulate.py -d burgers_1d
python evaluation/tabulate.py -d euler_1d
python evaluation/tabulate.py -d ns_2d
python evaluation/tabulate.py -d ns_transient_2d
python evaluation/tabulate.py -d epoch_1d
python evaluation/tabulate.py -d mpm_2dParameters:
-d: Dataset name to tabulate results for
Output: Summary tables are generated in Excel/CSV format
After generating task-level results with tabulate.py, you can create simulation-level aggregated summaries that combine all tasks within a simulation (dataset).
# Aggregate task-level results to simulation-level summaries
python evaluation/simul_sum.py -d heat_1d
python evaluation/simul_sum.py -d heat_2d
python evaluation/simul_sum.py -d burgers_1d
python evaluation/simul_sum.py -d euler_1d
python evaluation/simul_sum.py -d ns_2d
python evaluation/simul_sum.py -d ns_transient_2d
python evaluation/simul_sum.py -d epoch_1d
python evaluation/simul_sum.py -d mpm_2dParameters:
-d: Dataset name to aggregate results
Output:
- CSV:
eval_results/{dataset}/{dataset}_sum.csv- Combined results with precision_level column - Excel:
eval_results/{dataset}/{dataset}_sum.xlsx- Clean, professional formatting with visual separators
Note: Run tabulate.py first to generate the required task-level CSV files before running simul_sum.py.
The scripts/ directory contains automated scripts for streamlined execution of common workflows including inference + evaluation pipelines.
For detailed usage instructions and examples, see the Script Usage Guide.
