Skip to content

Rose-STL-Lab/SimulCost-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

276 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SimulCost-Bench

πŸ“– Paper | πŸ€— Dataset | πŸ› οΈ Tools | 🌐 Website | πŸ’Ύ Cache (Baseline) | πŸ’Ύ Cache (Full)

Introduction

SimulCost is a cost-aware benchmark and toolkit for evaluating how well LLM agents tune simulation parameters under realistic computational budgets. Unlike prior evaluations that focus on correctness (and sometimes token cost) while implicitly treating tool usage as β€œfree,” SimulCost explicitly measures both: (1) whether a proposed configuration meets an accuracy target and (2) how much simulation compute it consumes.

The benchmark covers 12 physics simulators across fluid dynamics, solid mechanics, and plasma physics, with single-round (one-shot initial guess) and multi-round (trial-and-error with feedback) settings. Tool cost is defined in a platform-independent way (analytical cost models / FLOPs) to make results reproducible and comparable.

SimulCost Overview

πŸ“š Table of Contents

πŸ“¦ Environment Setup

Clone repository

To clone the repository and initialize the submodule, use the following commands:

# Clone repository with submodule initialization
git clone --recursive https://github.com/Rose-STL-Lab/SimulCost-Bench.git

# If you've already cloned the repository without --recursive, run the following to initialize the submodule:
cd SimulCost-Bench
git submodule update --init --recursive

⚠️ Important: This repository uses deeply nested submodules (5 levels deep). Always use the --recursive flag when cloning to ensure all dependencies are initialized properly.

If you encounter submodule issues after cloning, try these commands:

# Sync submodule URLs (if they've changed)
git submodule sync --recursive

# Re-initialize all submodules
git submodule update --init --recursive

Create Conda environment

# Create Conda environment
conda env create -f environment.yml
conda activate simulcost

Install dependencies with Poetry

# Install dependencies with Poetry
poetry install --no-root

Note: To run 1D EPOCH PIC simulations, see the EPOCH Setup Guide for additional configuration requirements.

Note: To run 2D Euler gas dynamics simulations, see the Euler 2D Setup Guide for additional configuration requirements.

Note: To run 2D FEM simulations with FastIPC solver, see the FEM 2D Setup Guide for compilation and configuration requirements.

🐳 Docker Setup (Alternative)

A pre-built Docker image is available with all dependencies and solvers compiled. This lets you skip the Conda / Poetry / solver setup entirely.

1. Clone the repository:

git clone --recursive https://github.com/Rose-STL-Lab/SimulCost-Bench.git
cd SimulCost-Bench

2. Pull the image:

docker pull ghcr.io/leo-lsc/simulcost-bench:latest
docker tag ghcr.io/leo-lsc/simulcost-bench:latest simulcost-bench

Or build locally with docker build -t simulcost-bench .

3. Run the container (Dev / hot-reload): This mode mounts frequently-edited code directories from the host into the container so changes take effect immediately, without rebuilding the image.
Do NOT mount costsci_tools/ unless you also build the solvers on the host β€” otherwise you may overwrite the compiled binaries shipped in the image.

docker run --rm -it \
  --env-file .env \
  -v $(pwd)/scripts:/app/scripts \
  -v $(pwd)/inference:/app/inference \
  -v $(pwd)/evaluation:/app/evaluation \
  -v $(pwd)/configs:/app/configs \
  -v $(pwd)/custom_model:/app/custom_model \
  -v $(pwd)/sim_res:/app/sim_res \
  -v $(pwd)/eval_results:/app/eval_results \
  -v $(pwd)/results_model_attempt:/app/results_model_attempt \
  -v $(pwd)/log_model_tool_call:/app/log_model_tool_call \
  -v $(pwd)/data:/app/data \
  simulcost-bench
  • --env-file .env passes your API keys (e.g. OpenAI / AWS Bedrock) into the container.
  • -v mounts persist results to the host β€” without them, all output is lost when the container exits.

Note: The FEM 2D solver (FastIPC) is compiled with -mavx -mavx2 -mfma. Your CPU (both build and run host) must support AVX/AVX2/FMA instructions.

πŸ“‹ Tasks and Zero-Shot Support

Click to expand the full task support table

The table below summarizes the available tasks for each simulation type and indicates whether each task supports zero-shot inference.

Simulation Type Task Type Iterative & Zero-Shot
1D Heat Transfer cfl βœ… Supported
1D Heat Transfer n_space βœ… Supported
2D Steady Heat Transfer dx βœ… Supported
2D Steady Heat Transfer error_threshold βœ… Supported
2D Steady Heat Transfer relax ❌ Only Zero-Shot
2D Steady Heat Transfer t_init ❌ Only Zero-Shot
1D Burgers cfl βœ… Supported
1D Burgers k βœ… Supported (Compositional)
1D Burgers beta βœ… Supported (Compositional)
1D Burgers n_space βœ… Supported
1D Euler cfl βœ… Supported
1D Euler beta βœ… Supported (Compositional)
1D Euler k βœ… Supported (Compositional)
1D Euler n_space βœ… Supported
2D Navier-Stokes Channel mesh_x βœ… Supported
2D Navier-Stokes Channel mesh_y βœ… Supported
2D Navier-Stokes Channel omega_u ❌ Only Zero-Shot
2D Navier-Stokes Channel omega_v ❌ Only Zero-Shot
2D Navier-Stokes Channel omega_p ❌ Only Zero-Shot
2D Navier-Stokes Channel diff_u_threshold ❌ Only Zero-Shot
2D Navier-Stokes Channel diff_v_threshold ❌ Only Zero-Shot
2D Navier-Stokes Channel res_iter_v_threshold ❌ Only Zero-Shot
2D Navier-Stokes Transient resolution βœ… Supported
2D Navier-Stokes Transient cfl βœ… Supported
2D Navier-Stokes Transient relaxation_factor ❌ Only Zero-Shot
2D Navier-Stokes Transient residual_threshold ❌ Only Zero-Shot
1D EPOCH PIC nx βœ… Supported
1D EPOCH PIC npart βœ… Supported
1D EPOCH PIC dt_multiplier βœ… Supported (Compositional)
1D EPOCH PIC field_order βœ… Supported (Compositional)
1D EPOCH PIC particle_order βœ… Supported (Compositional)
2D MPM nx βœ… Supported
2D MPM n_part βœ… Supported
2D MPM cfl βœ… Supported
1D Diffusion-Reaction cfl βœ… Supported
1D Diffusion-Reaction n_space βœ… Supported
1D Diffusion-Reaction tol βœ… Supported
2D Euler Gas Dynamics n_grid_x βœ… Supported
2D Euler Gas Dynamics cfl βœ… Supported
2D Euler Gas Dynamics cg_tolerance βœ… Supported
Hasegawa-Mima Nonlinear N βœ… Supported
Hasegawa-Mima Nonlinear dt βœ… Supported
Hasegawa-Mima Linear N βœ… Supported
Hasegawa-Mima Linear dt βœ… Supported
Hasegawa-Mima Linear cg_atol βœ… Supported
2D FEM dx βœ… Supported
2D FEM cfl βœ… Supported

πŸ•΅οΈ Generate Questions

Generate question templates for different physics domains and task types.

Click to view commands for all simulation types
# 1D Heat Transfer
python qs_gen/1D_heat_transfer.py

# 2D Steady Heat Transfer
python qs_gen/2D_heat_transfer.py

# Burgers 1D Equation with 2nd Order Roe Method
python qs_gen/1D_burgers.py

# Euler 1D Equations with 2nd Order MUSCL-Roe Method
python qs_gen/1D_euler.py

# 2D Navier-Stokes Channel Flow with SIMPLE Algorithm
python qs_gen/2D_ns.py

# 2D Navier-Stokes Transient Flow with Taichi Framework
python qs_gen/2D_ns_transient.py

# 1D EPOCH Particle-in-Cell Simulation
python qs_gen/1D_epoch.py

# 2D Material Point Method (MPM) Simulation
python qs_gen/2D_mpm.py

# 1D Diffusion-Reaction Equations with Newton Method
python qs_gen/1D_diff_react.py

# 2D Euler Equations with Advection-Projection Method
python qs_gen/2D_euler.py

# Hasegawa-Mima Nonlinear Equation with Pseudo-Spectral Method
python qs_gen/hasegawa_mima_nonlinear.py

# Hasegawa-Mima Linear Equation with RK4 and CG Solver
python qs_gen/hasegawa_mima_linear.py

# 2D Finite Element Method with Implicit Newton Solver
python qs_gen/2D_fem.py

Output: Generated questions are saved to data/{simulation}/{task}/{precision_level}/question.json

πŸš€ Generate Benchmark Datasets

Create complete benchmark datasets with problem instances and ground truth solutions.

Click to view commands for all simulation types
# 1D Heat Transfer
python dataset_gen/oneD_heat_transfer.py

# 2D Steady Heat Transfer
python dataset_gen/twoD_heat_transfer.py

# Burgers 1D Equation with 2nd Order Roe Method
python dataset_gen/oneD_burgers.py

# Euler 1D Equations with 2nd Order MUSCL-Roe Method
python dataset_gen/oneD_euler.py

# 2D Navier-Stokes Channel Flow with SIMPLE Algorithm
python dataset_gen/twoD_ns.py

# 2D Navier-Stokes Transient Flow with Taichi Framework
python dataset_gen/twoD_ns_transient.py

# 1D EPOCH Particle-in-Cell Simulation
python dataset_gen/oneD_epoch.py

# 2D Material Point Method (MPM) Simulation
python dataset_gen/twoD_mpm.py

# 1D Diffusion-Reaction Equations with Newton Method
python dataset_gen/oneD_diff_react.py

# 2D Euler Equations with Advection-Projection Method
python dataset_gen/twoD_euler.py

# Hasegawa-Mima Nonlinear Equation with Pseudo-Spectral Method
python dataset_gen/hasegawa_mima_nonlinear.py

# Hasegawa-Mima Linear Equation with RK4 and CG Solver
python dataset_gen/hasegawa_mima_linear.py

# 2D Finite Element Method with Implicit Newton Solver
python dataset_gen/twoD_fem.py

Output: Datasets are saved to: data/{simulation}/{task}/{precision_level}/human_write/ directory

πŸ“„ Configure Model Providers

🌐 Commercial API Models

Configure API keys in your .env file:

# OpenAI API key
OPENAI_API_KEY=your_openai_api_key

# Google API key for Gemini
GOOGLE_API_KEY=your_google_api_key

# AWS API credentials for Bedrock
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION_NAME=your_aws_region_name

🧠 Custom Models Configuration

Two configuration methods available:

Method 1: JSON Configuration (Recommended for Multiple Models)

Copy the example file and create your own configuration:

cp configs/custom_models.json.example configs/custom_models.json

Then edit configs/custom_models.json to manage multiple custom models:

{
  "custom_models": {
    "qwen3_8b": {
      "custom_code": "/path/to/custom_model/custom_inference.py",
      "model_path": "/data/models/Qwen3-8B",
      "custom_class": "Qwen3"
    },
    "llama3_7b": {
      "custom_code": "/path/to/custom_model/custom_inference.py", 
      "model_path": "/data/models/Llama3-7B",
      "custom_class": "Llama3"
    }
  }
}

Method 2: Environment Variables (For Single Model)

Set in your .env file:

custom_code="/path/to/custom_model/custom_inference.py"
model_path="/path/to/your/custom_model"
custom_class="CustomModel"

πŸ“‹ List Available Models:

python scripts/list_custom_models.py

For detailed implementation guide, see Custom Model Integration Guide.

πŸ“‚ Customize Simulation Results Directory

By default, simulation results are saved to sim_res/ in the current working directory. You can configure a custom absolute path for storing simulation results.

Configuration

Add the following to your .env file:

# Optional: Specify custom directory for simulation results
# If not set, results will be saved to ./sim_res/ (relative path)
SIM_RES_BASE_DIR=/path/to/your/custom/directory

Examples

Use absolute path:

SIM_RES_BASE_DIR=/data/leo_work_new

Results will be saved to: /data/leo_work_new/sim_res/...

Use default relative path:

# Comment out or remove the SIM_RES_BASE_DIR line
# SIM_RES_BASE_DIR=/data/leo_work_new

Results will be saved to: ./sim_res/... (relative to working directory)

Docker users: If you run via Docker, you don't need SIM_RES_BASE_DIR. Simply mount your desired host directory to the container's sim_res/ path when running docker run, e.g. -v /your/host/dir:/app/sim_res.

πŸ“₯ Pre-cached Simulation Results

⚑ Pre-cached results (optional)

To help you skip long simulation runtimes, pre-computed simulation results are available on Hugging Face.

Option A β€” Baseline cache (recommended for quick start)

Simulation File Size Simulation Type
burgers_1d.zip 675 MB 1D Burgers Equation
diff_react_1d.zip 293 MB 1D Diffusion-Reaction
epoch.zip 3.65 GB 1D EPOCH PIC
euler_1d.zip 3.86 GB 1D Euler Equations
euler_2d.zip 1.54 GB 2D Euler Gas Dynamics
fem_2d.zip 146 MB 2D Finite Element Method
hasegawa_mima_linear.zip 1.45 GB Hasegawa-Mima Linear
hasegawa_mima_nonlinear.zip 300 MB Hasegawa-Mima Nonlinear
heat_1d.zip 182 MB 1D Heat Transfer
heat_2d.zip 2.66 GB 2D Steady Heat Transfer
ns_channel_2d.zip 96.7 MB 2D Navier-Stokes Channel
ns_transient_2d.zip 6.56 GB 2D Navier-Stokes Transient
unstruct_mpm.zip 1.13 GB 2D Material Point Method

Option B β€” Full cache (complete project cache)

Simulation File Size Simulation Type
burgers_1d.tar.gz 1.69 GB 1D Burgers Equation
diff_react_1d.tar.gz 1.35 GB 1D Diffusion-Reaction
epoch.tar.gz 3.65 GB 1D EPOCH PIC
euler_1d.tar.gz 14.1 GB 1D Euler Equations
euler_2d.tar.gz 5.64 GB 2D Euler Gas Dynamics
fem2d.tar.gz 877 MB 2D Finite Element Method
hasegawa_mima_linear.tar.gz 10.8 GB Hasegawa-Mima Linear
hasegawa_mima_nonlinear.tar.gz 1.32 GB Hasegawa-Mima Nonlinear
heat_1d.tar.gz 3.46 GB 1D Heat Transfer
heat_steady_2d.tar.gz 5.19 GB 2D Steady Heat Transfer
ns_transient_2d.tar.gz 58.8 GB 2D Navier-Stokes Transient
unstruct_mpm.tar.gz 221 GB 2D Material Point Method

How to use

Download the files you need from the dataset page and place/extract them into your simulation results directory (e.g., ./sim_res/ or your SIM_RES_BASE_DIR).

🧠 Run Inference

πŸš€ Simulation Speed Reference

If you're getting started, begin with the β€œFast” group to avoid wasting time on very slow simulations.

  • Fast (recommended to start): burgers_1d, diff_react_1d, heat_1d, fem_2d
  • Moderate: euler_1d, euler_2d, ns_transient_2d, unstruct_mpm,hasegawa_mima_linear, hasegawa_mima_nonlinear, epoch
  • Slow: heat_2d, ns_channel_2d (Can be very slow⚠️)

Execute LLM inference on benchmark datasets to generate predictions.

# Commercial API Models
python inference/langchain_LLM.py -p openai -m gpt-5-2025-08-07 -d heat_1d -t cfl -l medium -z

# Single Custom Model
python inference/langchain_LLM.py -p custom_model -m qwen3_8b -d heat_1d -t cfl -l medium -z

# Multiple Custom Models (use batch scripts)
bash scripts/inference_eval/inference_eval_heat_1d.sh

Parameters:

  • -p: LLM provider (openai, gemini, bedrock, custom_model)
  • -m: Model name/identifier
  • -d: Dataset name
  • -t: Problem task type
  • -l: Precision level
  • -z: Enable zero-shot mode
  • --list-combinations: Show all valid dataset-task combinations and exit

Outputs:

  • Results: results_model_attempt/{dataset}/{precision_level}/{task}/
  • Logs: log_model_tool_call/{dataset}/{precision_level}/{task}/

πŸ’‘ Tip: Use --list-combinations to see all available dataset-task combinations:

python inference/langchain_LLM.py --list-combinations
OpenAI Reasoning Models: Setting reasoning_effort

For OpenAI reasoning models (e.g., GPT-5), you can control the reasoning effort by appending -re-{effort} to the model name:

  • Syntax: {model_name}-re-{effort}
  • Valid effort levels: minimal, low, medium, high

Example:

# Use GPT-5 with minimal reasoning effort
python inference/langchain_LLM.py -p openai -m gpt-5-2025-08-07-re-minimal -d heat_1d -t cfl -l medium -z

πŸ”„ Resume Functionality

The inference system includes automatic progress tracking and resume capabilities to handle long-running experiments gracefully.

Features

  • Automatic Progress Saving: Progress is saved after each completed sample
  • Resume from Interruption: Continue from where you left off using the --resume flag

Usage

python inference/langchain_LLM.py -p custom_model -m qwen3_8b -d heat_1d -t cfl -l medium --resume

Progress Files

  • Location: log_model_tool_call/{dataset}/{precision_level}/{task}/{flag}_{model_name}_progress.json
  • Content: Contains completed results and intermediate data for resuming

πŸ“Š Evaluate Models' Performance

Compute performance metrics and accuracy scores for model predictions.

# Example (Heat 1D)
python evaluation/heat_1d/eval.py -m anthropic.claude-3-5-haiku-20241022-v1:0 -d heat_1d -t cfl -l medium -z

Parameters:

  • -m: Model name/identifier
  • -d: Dataset name
  • -t: Problem task type
  • -l: Precision level (for heat_1d, heat_2d, burgers_1d and euler_1d: low, medium, high; default: medium)
  • -z: Enable zero-shot mode

Outputs:

  • JSON results: eval_results/{dataset}/{task}/{precision_level}/
  • Parquet dataframes: eval_results/{dataset}/dataframes/

πŸ“– For advanced data analysis workflows and parquet usage, see Evaluation Documentation.

πŸ—‚οΈ Tabulate Evaluation Results

Generate summary tables and comparative analysis across different models and tasks.

python evaluation/tabulate.py -d heat_1d
python evaluation/tabulate.py -d heat_2d
python evaluation/tabulate.py -d burgers_1d
python evaluation/tabulate.py -d euler_1d
python evaluation/tabulate.py -d ns_2d
python evaluation/tabulate.py -d ns_transient_2d
python evaluation/tabulate.py -d epoch_1d
python evaluation/tabulate.py -d mpm_2d

Parameters:

  • -d: Dataset name to tabulate results for

Output: Summary tables are generated in Excel/CSV format

πŸ“ˆ Generate Simulation-Level Summaries

After generating task-level results with tabulate.py, you can create simulation-level aggregated summaries that combine all tasks within a simulation (dataset).

# Aggregate task-level results to simulation-level summaries
python evaluation/simul_sum.py -d heat_1d
python evaluation/simul_sum.py -d heat_2d
python evaluation/simul_sum.py -d burgers_1d
python evaluation/simul_sum.py -d euler_1d
python evaluation/simul_sum.py -d ns_2d
python evaluation/simul_sum.py -d ns_transient_2d
python evaluation/simul_sum.py -d epoch_1d
python evaluation/simul_sum.py -d mpm_2d

Parameters:

  • -d: Dataset name to aggregate results

Output:

  • CSV: eval_results/{dataset}/{dataset}_sum.csv - Combined results with precision_level column
  • Excel: eval_results/{dataset}/{dataset}_sum.xlsx - Clean, professional formatting with visual separators

Note: Run tabulate.py first to generate the required task-level CSV files before running simul_sum.py.

πŸ› οΈ Script Usage Guide

The scripts/ directory contains automated scripts for streamlined execution of common workflows including inference + evaluation pipelines.

For detailed usage instructions and examples, see the Script Usage Guide.

About

SimulCost: A Cost-Aware Benchmark for Automating Physics Simulations with LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors