SimulCost-Bench

Introduction

SimulCost is a cost-aware benchmark and toolkit for evaluating how well LLM agents tune simulation parameters under realistic computational budgets. Unlike prior evaluations that focus on correctness (and sometimes token cost) while implicitly treating tool usage as “free,” SimulCost explicitly measures both: (1) whether a proposed configuration meets an accuracy target and (2) how much simulation compute it consumes.

The benchmark covers 12 physics simulators across fluid dynamics, solid mechanics, and plasma physics, with single-round (one-shot initial guess) and multi-round (trial-and-error with feedback) settings. Tool cost is defined in a platform-independent way (analytical cost models / FLOPs) to make results reproducible and comparable.

📦 Environment Setup

Clone repository

To clone the repository and initialize the submodule, use the following commands:

# Clone repository with submodule initialization
git clone --recursive https://github.com/Rose-STL-Lab/SimulCost-Bench.git

# If you've already cloned the repository without --recursive, run the following to initialize the submodule:
cd SimulCost-Bench
git submodule update --init --recursive

⚠️ Important: This repository uses deeply nested submodules (5 levels deep). Always use the --recursive flag when cloning to ensure all dependencies are initialized properly.

If you encounter submodule issues after cloning, try these commands:

# Sync submodule URLs (if they've changed)
git submodule sync --recursive

# Re-initialize all submodules
git submodule update --init --recursive

Create Conda environment

# Create Conda environment
conda env create -f environment.yml
conda activate simulcost

Install dependencies with Poetry

# Install dependencies with Poetry
poetry install --no-root

Note: To run 1D EPOCH PIC simulations, see the EPOCH Setup Guide for additional configuration requirements.

Note: To run 2D Euler gas dynamics simulations, see the Euler 2D Setup Guide for additional configuration requirements.

Note: To run 2D FEM simulations with FastIPC solver, see the FEM 2D Setup Guide for compilation and configuration requirements.

🐳 Docker Setup (Alternative)

A pre-built Docker image is available with all dependencies and solvers compiled. This lets you skip the Conda / Poetry / solver setup entirely.

1. Clone the repository:

git clone --recursive https://github.com/Rose-STL-Lab/SimulCost-Bench.git
cd SimulCost-Bench

2. Pull the image:

docker pull ghcr.io/leo-lsc/simulcost-bench:latest
docker tag ghcr.io/leo-lsc/simulcost-bench:latest simulcost-bench

Or build locally with docker build -t simulcost-bench .

3. Run the container (Dev / hot-reload): This mode mounts frequently-edited code directories from the host into the container so changes take effect immediately, without rebuilding the image.
Do NOT mount costsci_tools/ unless you also build the solvers on the host — otherwise you may overwrite the compiled binaries shipped in the image.

docker run --rm -it \
  --env-file .env \
  -v $(pwd)/scripts:/app/scripts \
  -v $(pwd)/inference:/app/inference \
  -v $(pwd)/evaluation:/app/evaluation \
  -v $(pwd)/configs:/app/configs \
  -v $(pwd)/custom_model:/app/custom_model \
  -v $(pwd)/sim_res:/app/sim_res \
  -v $(pwd)/eval_results:/app/eval_results \
  -v $(pwd)/results_model_attempt:/app/results_model_attempt \
  -v $(pwd)/log_model_tool_call:/app/log_model_tool_call \
  -v $(pwd)/data:/app/data \
  simulcost-bench

--env-file .env passes your API keys (e.g. OpenAI / AWS Bedrock) into the container.
-v mounts persist results to the host — without them, all output is lost when the container exits.

Note: The FEM 2D solver (FastIPC) is compiled with -mavx -mavx2 -mfma. Your CPU (both build and run host) must support AVX/AVX2/FMA instructions.

📋 Tasks and Zero-Shot Support

Click to expand the full task support table

The table below summarizes the available tasks for each simulation type and indicates whether each task supports zero-shot inference.

Simulation Type	Task Type	Iterative & Zero-Shot
1D Heat Transfer	`cfl`	✅ Supported
1D Heat Transfer	`n_space`	✅ Supported
2D Steady Heat Transfer	`dx`	✅ Supported
2D Steady Heat Transfer	`error_threshold`	✅ Supported
2D Steady Heat Transfer	`relax`	❌ Only Zero-Shot
2D Steady Heat Transfer	`t_init`	❌ Only Zero-Shot
1D Burgers	`cfl`	✅ Supported
1D Burgers	`k`	✅ Supported (Compositional)
1D Burgers	`beta`	✅ Supported (Compositional)
1D Burgers	`n_space`	✅ Supported
1D Euler	`cfl`	✅ Supported
1D Euler	`beta`	✅ Supported (Compositional)
1D Euler	`k`	✅ Supported (Compositional)
1D Euler	`n_space`	✅ Supported
2D Navier-Stokes Channel	`mesh_x`	✅ Supported
2D Navier-Stokes Channel	`mesh_y`	✅ Supported
2D Navier-Stokes Channel	`omega_u`	❌ Only Zero-Shot
2D Navier-Stokes Channel	`omega_v`	❌ Only Zero-Shot
2D Navier-Stokes Channel	`omega_p`	❌ Only Zero-Shot
2D Navier-Stokes Channel	`diff_u_threshold`	❌ Only Zero-Shot
2D Navier-Stokes Channel	`diff_v_threshold`	❌ Only Zero-Shot
2D Navier-Stokes Channel	`res_iter_v_threshold`	❌ Only Zero-Shot
2D Navier-Stokes Transient	`resolution`	✅ Supported
2D Navier-Stokes Transient	`cfl`	✅ Supported
2D Navier-Stokes Transient	`relaxation_factor`	❌ Only Zero-Shot
2D Navier-Stokes Transient	`residual_threshold`	❌ Only Zero-Shot
1D EPOCH PIC	`nx`	✅ Supported
1D EPOCH PIC	`npart`	✅ Supported
1D EPOCH PIC	`dt_multiplier`	✅ Supported (Compositional)
1D EPOCH PIC	`field_order`	✅ Supported (Compositional)
1D EPOCH PIC	`particle_order`	✅ Supported (Compositional)
2D MPM	`nx`	✅ Supported
2D MPM	`n_part`	✅ Supported
2D MPM	`cfl`	✅ Supported
1D Diffusion-Reaction	`cfl`	✅ Supported
1D Diffusion-Reaction	`n_space`	✅ Supported
1D Diffusion-Reaction	`tol`	✅ Supported
2D Euler Gas Dynamics	`n_grid_x`	✅ Supported
2D Euler Gas Dynamics	`cfl`	✅ Supported
2D Euler Gas Dynamics	`cg_tolerance`	✅ Supported
Hasegawa-Mima Nonlinear	`N`	✅ Supported
Hasegawa-Mima Nonlinear	`dt`	✅ Supported
Hasegawa-Mima Linear	`N`	✅ Supported
Hasegawa-Mima Linear	`dt`	✅ Supported
Hasegawa-Mima Linear	`cg_atol`	✅ Supported
2D FEM	`dx`	✅ Supported
2D FEM	`cfl`	✅ Supported

🕵️ Generate Questions

Generate question templates for different physics domains and task types.

Click to view commands for all simulation types

# 1D Heat Transfer
python qs_gen/1D_heat_transfer.py

# 2D Steady Heat Transfer
python qs_gen/2D_heat_transfer.py

# Burgers 1D Equation with 2nd Order Roe Method
python qs_gen/1D_burgers.py

# Euler 1D Equations with 2nd Order MUSCL-Roe Method
python qs_gen/1D_euler.py

# 2D Navier-Stokes Channel Flow with SIMPLE Algorithm
python qs_gen/2D_ns.py

# 2D Navier-Stokes Transient Flow with Taichi Framework
python qs_gen/2D_ns_transient.py

# 1D EPOCH Particle-in-Cell Simulation
python qs_gen/1D_epoch.py

# 2D Material Point Method (MPM) Simulation
python qs_gen/2D_mpm.py

# 1D Diffusion-Reaction Equations with Newton Method
python qs_gen/1D_diff_react.py

# 2D Euler Equations with Advection-Projection Method
python qs_gen/2D_euler.py

# Hasegawa-Mima Nonlinear Equation with Pseudo-Spectral Method
python qs_gen/hasegawa_mima_nonlinear.py

# Hasegawa-Mima Linear Equation with RK4 and CG Solver
python qs_gen/hasegawa_mima_linear.py

# 2D Finite Element Method with Implicit Newton Solver
python qs_gen/2D_fem.py

Output: Generated questions are saved to data/{simulation}/{task}/{precision_level}/question.json

🚀 Generate Benchmark Datasets

Create complete benchmark datasets with problem instances and ground truth solutions.

Click to view commands for all simulation types

# 1D Heat Transfer
python dataset_gen/oneD_heat_transfer.py

# 2D Steady Heat Transfer
python dataset_gen/twoD_heat_transfer.py

# Burgers 1D Equation with 2nd Order Roe Method
python dataset_gen/oneD_burgers.py

# Euler 1D Equations with 2nd Order MUSCL-Roe Method
python dataset_gen/oneD_euler.py

# 2D Navier-Stokes Channel Flow with SIMPLE Algorithm
python dataset_gen/twoD_ns.py

# 2D Navier-Stokes Transient Flow with Taichi Framework
python dataset_gen/twoD_ns_transient.py

# 1D EPOCH Particle-in-Cell Simulation
python dataset_gen/oneD_epoch.py

# 2D Material Point Method (MPM) Simulation
python dataset_gen/twoD_mpm.py

# 1D Diffusion-Reaction Equations with Newton Method
python dataset_gen/oneD_diff_react.py

# 2D Euler Equations with Advection-Projection Method
python dataset_gen/twoD_euler.py

# Hasegawa-Mima Nonlinear Equation with Pseudo-Spectral Method
python dataset_gen/hasegawa_mima_nonlinear.py

# Hasegawa-Mima Linear Equation with RK4 and CG Solver
python dataset_gen/hasegawa_mima_linear.py

# 2D Finite Element Method with Implicit Newton Solver
python dataset_gen/twoD_fem.py

Output: Datasets are saved to: data/{simulation}/{task}/{precision_level}/human_write/ directory

📄 Configure Model Providers

🌐 Commercial API Models

Configure API keys in your .env file:

# OpenAI API key
OPENAI_API_KEY=your_openai_api_key

# Google API key for Gemini
GOOGLE_API_KEY=your_google_api_key

# AWS API credentials for Bedrock
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION_NAME=your_aws_region_name

🧠 Custom Models Configuration

Two configuration methods available:

Method 1: JSON Configuration (Recommended for Multiple Models)

Copy the example file and create your own configuration:

cp configs/custom_models.json.example configs/custom_models.json

Then edit configs/custom_models.json to manage multiple custom models:

{
  "custom_models": {
    "qwen3_8b": {
      "custom_code": "/path/to/custom_model/custom_inference.py",
      "model_path": "/data/models/Qwen3-8B",
      "custom_class": "Qwen3"
    },
    "llama3_7b": {
      "custom_code": "/path/to/custom_model/custom_inference.py", 
      "model_path": "/data/models/Llama3-7B",
      "custom_class": "Llama3"
    }
  }
}

Method 2: Environment Variables (For Single Model)

Set in your .env file:

custom_code="/path/to/custom_model/custom_inference.py"
model_path="/path/to/your/custom_model"
custom_class="CustomModel"

📋 List Available Models:

python scripts/list_custom_models.py

For detailed implementation guide, see Custom Model Integration Guide.

📂 Customize Simulation Results Directory

By default, simulation results are saved to sim_res/ in the current working directory. You can configure a custom absolute path for storing simulation results.

Configuration

Add the following to your .env file:

# Optional: Specify custom directory for simulation results
# If not set, results will be saved to ./sim_res/ (relative path)
SIM_RES_BASE_DIR=/path/to/your/custom/directory

Examples

Use absolute path:

SIM_RES_BASE_DIR=/data/leo_work_new

Results will be saved to: /data/leo_work_new/sim_res/...

Use default relative path:

# Comment out or remove the SIM_RES_BASE_DIR line
# SIM_RES_BASE_DIR=/data/leo_work_new

Results will be saved to: ./sim_res/... (relative to working directory)

Docker users: If you run via Docker, you don't need SIM_RES_BASE_DIR. Simply mount your desired host directory to the container's sim_res/ path when running docker run, e.g. -v /your/host/dir:/app/sim_res.

📥 Pre-cached Simulation Results

⚡ Pre-cached results (optional)

To help you skip long simulation runtimes, pre-computed simulation results are available on Hugging Face.

Option A — Baseline cache (recommended for quick start)

Baseline cache (≈22.5 GB): LeoLai689/SimulCost-baseline-sim_res
Contains pre-computed results for all baseline experiments, and is much smaller to download.

Simulation File	Size	Simulation Type
`burgers_1d.zip`	675 MB	1D Burgers Equation
`diff_react_1d.zip`	293 MB	1D Diffusion-Reaction
`epoch.zip`	3.65 GB	1D EPOCH PIC
`euler_1d.zip`	3.86 GB	1D Euler Equations
`euler_2d.zip`	1.54 GB	2D Euler Gas Dynamics
`fem_2d.zip`	146 MB	2D Finite Element Method
`hasegawa_mima_linear.zip`	1.45 GB	Hasegawa-Mima Linear
`hasegawa_mima_nonlinear.zip`	300 MB	Hasegawa-Mima Nonlinear
`heat_1d.zip`	182 MB	1D Heat Transfer
`heat_2d.zip`	2.66 GB	2D Steady Heat Transfer
`ns_channel_2d.zip`	96.7 MB	2D Navier-Stokes Channel
`ns_transient_2d.zip`	6.56 GB	2D Navier-Stokes Transient
`unstruct_mpm.zip`	1.13 GB	2D Material Point Method

Option B — Full cache (complete project cache)

Full cache (≈328 GB): LeoLai689/SimulCost-full-sim_res
- This is the complete simulation cache used in this project.

Simulation File	Size	Simulation Type
`burgers_1d.tar.gz`	1.69 GB	1D Burgers Equation
`diff_react_1d.tar.gz`	1.35 GB	1D Diffusion-Reaction
`epoch.tar.gz`	3.65 GB	1D EPOCH PIC
`euler_1d.tar.gz`	14.1 GB	1D Euler Equations
`euler_2d.tar.gz`	5.64 GB	2D Euler Gas Dynamics
`fem2d.tar.gz`	877 MB	2D Finite Element Method
`hasegawa_mima_linear.tar.gz`	10.8 GB	Hasegawa-Mima Linear
`hasegawa_mima_nonlinear.tar.gz`	1.32 GB	Hasegawa-Mima Nonlinear
`heat_1d.tar.gz`	3.46 GB	1D Heat Transfer
`heat_steady_2d.tar.gz`	5.19 GB	2D Steady Heat Transfer
`ns_transient_2d.tar.gz`	58.8 GB	2D Navier-Stokes Transient
`unstruct_mpm.tar.gz`	221 GB	2D Material Point Method

How to use

Download the files you need from the dataset page and place/extract them into your simulation results directory (e.g., ./sim_res/ or your SIM_RES_BASE_DIR).

🧠 Run Inference

🚀 Simulation Speed Reference

If you're getting started, begin with the “Fast” group to avoid wasting time on very slow simulations.

Fast (recommended to start): burgers_1d, diff_react_1d, heat_1d, fem_2d
Moderate: euler_1d, euler_2d, ns_transient_2d, unstruct_mpm,hasegawa_mima_linear, hasegawa_mima_nonlinear, epoch
Slow: heat_2d, ns_channel_2d (Can be very slow⚠️)

Execute LLM inference on benchmark datasets to generate predictions.

# Commercial API Models
python inference/langchain_LLM.py -p openai -m gpt-5-2025-08-07 -d heat_1d -t cfl -l medium -z

# Single Custom Model
python inference/langchain_LLM.py -p custom_model -m qwen3_8b -d heat_1d -t cfl -l medium -z

# Multiple Custom Models (use batch scripts)
bash scripts/inference_eval/inference_eval_heat_1d.sh

Parameters:

-p: LLM provider (openai, gemini, bedrock, custom_model)
-m: Model name/identifier
-d: Dataset name
-t: Problem task type
-l: Precision level
-z: Enable zero-shot mode
--list-combinations: Show all valid dataset-task combinations and exit

Outputs:

Results: results_model_attempt/{dataset}/{precision_level}/{task}/
Logs: log_model_tool_call/{dataset}/{precision_level}/{task}/

💡 Tip: Use --list-combinations to see all available dataset-task combinations:

python inference/langchain_LLM.py --list-combinations

OpenAI Reasoning Models: Setting reasoning_effort

For OpenAI reasoning models (e.g., GPT-5), you can control the reasoning effort by appending -re-{effort} to the model name:

Syntax: {model_name}-re-{effort}
Valid effort levels: minimal, low, medium, high

Example:

# Use GPT-5 with minimal reasoning effort
python inference/langchain_LLM.py -p openai -m gpt-5-2025-08-07-re-minimal -d heat_1d -t cfl -l medium -z

🔄 Resume Functionality

The inference system includes automatic progress tracking and resume capabilities to handle long-running experiments gracefully.

Features

Automatic Progress Saving: Progress is saved after each completed sample
Resume from Interruption: Continue from where you left off using the --resume flag

Usage

python inference/langchain_LLM.py -p custom_model -m qwen3_8b -d heat_1d -t cfl -l medium --resume

Progress Files

Location: log_model_tool_call/{dataset}/{precision_level}/{task}/{flag}_{model_name}_progress.json
Content: Contains completed results and intermediate data for resuming

📊 Evaluate Models' Performance

Compute performance metrics and accuracy scores for model predictions.

# Example (Heat 1D)
python evaluation/heat_1d/eval.py -m anthropic.claude-3-5-haiku-20241022-v1:0 -d heat_1d -t cfl -l medium -z

Parameters:

-m: Model name/identifier
-d: Dataset name
-t: Problem task type
-l: Precision level (for heat_1d, heat_2d, burgers_1d and euler_1d: low, medium, high; default: medium)
-z: Enable zero-shot mode

Outputs:

JSON results: eval_results/{dataset}/{task}/{precision_level}/
Parquet dataframes: eval_results/{dataset}/dataframes/

📖 For advanced data analysis workflows and parquet usage, see Evaluation Documentation.

🗂️ Tabulate Evaluation Results

Generate summary tables and comparative analysis across different models and tasks.

python evaluation/tabulate.py -d heat_1d
python evaluation/tabulate.py -d heat_2d
python evaluation/tabulate.py -d burgers_1d
python evaluation/tabulate.py -d euler_1d
python evaluation/tabulate.py -d ns_2d
python evaluation/tabulate.py -d ns_transient_2d
python evaluation/tabulate.py -d epoch_1d
python evaluation/tabulate.py -d mpm_2d

Parameters:

-d: Dataset name to tabulate results for

Output: Summary tables are generated in Excel/CSV format

📈 Generate Simulation-Level Summaries

After generating task-level results with tabulate.py, you can create simulation-level aggregated summaries that combine all tasks within a simulation (dataset).

# Aggregate task-level results to simulation-level summaries
python evaluation/simul_sum.py -d heat_1d
python evaluation/simul_sum.py -d heat_2d
python evaluation/simul_sum.py -d burgers_1d
python evaluation/simul_sum.py -d euler_1d
python evaluation/simul_sum.py -d ns_2d
python evaluation/simul_sum.py -d ns_transient_2d
python evaluation/simul_sum.py -d epoch_1d
python evaluation/simul_sum.py -d mpm_2d

Parameters:

-d: Dataset name to aggregate results

Output:

CSV: eval_results/{dataset}/{dataset}_sum.csv - Combined results with precision_level column
Excel: eval_results/{dataset}/{dataset}_sum.xlsx - Clean, professional formatting with visual separators

Note: Run tabulate.py first to generate the required task-level CSV files before running simul_sum.py.

🛠️ Script Usage Guide

The scripts/ directory contains automated scripts for streamlined execution of common workflows including inference + evaluation pipelines.

For detailed usage instructions and examples, see the Script Usage Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 276 Commits
assets		assets
configs		configs
costsci_tools @ 9fd8ecf		costsci_tools @ 9fd8ecf
custom_model		custom_model
dataset_gen		dataset_gen
eval_results		eval_results
evaluation		evaluation
inference		inference
plots		plots
qs_gen		qs_gen
scripts		scripts
tool_call		tool_call
tool_documentation		tool_documentation
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
EPOCH_SETUP.md		EPOCH_SETUP.md
EULER_2D_SETUP.md		EULER_2D_SETUP.md
FEM_2D_SETUP.md		FEM_2D_SETUP.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SimulCost-Bench

Introduction

📚 Table of Contents

📦 Environment Setup

Clone repository

Create Conda environment

Install dependencies with Poetry

🐳 Docker Setup (Alternative)

📋 Tasks and Zero-Shot Support

🕵️ Generate Questions

🚀 Generate Benchmark Datasets

📄 Configure Model Providers

🌐 Commercial API Models

🧠 Custom Models Configuration

Method 1: JSON Configuration (Recommended for Multiple Models)

Method 2: Environment Variables (For Single Model)

📂 Customize Simulation Results Directory

Configuration

Examples

📥 Pre-cached Simulation Results

Option A — Baseline cache (recommended for quick start)

Option B — Full cache (complete project cache)

How to use

🧠 Run Inference

🚀 Simulation Speed Reference

🔄 Resume Functionality

Features

Usage

Progress Files

📊 Evaluate Models' Performance

🗂️ Tabulate Evaluation Results

📈 Generate Simulation-Level Summaries

🛠️ Script Usage Guide

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages