Skip to content
Vahid Majdinasab edited this page Mar 6, 2026 · 5 revisions

PrismBench

Systematic evaluation of language models through Monte Carlo Tree Search

Documentation Python License


Overview

PrismBench is a comprehensive framework for evaluating Large Language Model capabilities in computer science problem-solving. Using a three-phase Monte Carlo Tree Search approach, it systematically maps model strengths, discovers challenging areas, and provides detailed performance analysis.

Core Approach:

  • Phase 1: Maps initial capabilities across CS concepts
  • Phase 2: Discovers challenging concept combinations
  • Phase 3: Conducts comprehensive evaluation of weaknesses

Getting Started

New to PrismBench? Follow our quick start guide to get running in 5 minutes.

Quick Start Guide →

Need detailed setup? See our comprehensive configuration documentation.

Configuration Guide →


Core Documentation

Framework Components

Component Description Documentation
MCTS Algorithm Three-phase search strategy for capability mapping MCTS Algorithm →
Agent System Multi-agent architecture for challenge creation and evaluation Agent System →
Environment System Pluggable evaluation environments for different scenarios Environment System →
Architecture System design and component interactions Architecture Overview →

Analysis & Results

Topic Description Documentation
Results Analysis Understanding and interpreting evaluation results Results Analysis →
Tree Structure Search tree implementation and concept organization Tree Structure →

Extending PrismBench

PrismBench is designed to be extensible, allowing you to add custom agents, environments, and MCTS phases.


System Architecture

PrismBench follows a microservices architecture with four services:

┌──────────────┐    ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ GUI          │    │ Search          │    │ Environment      │    │ LLM Interface   │
│ Port 3000    │◄──►│ Port 8002       │◄──►│ Port 8001        │◄──►│ Port 8000       │
│ Web Frontend │    │ MCTS Engine     │    │ Challenge Exec   │    │ Model Comm      │
└──────────────┘    └─────────────────┘    └──────────────────┘    └─────────────────┘

Detailed Architecture →


Key Features

  • Systematic Evaluation through MCTS-driven exploration
  • Challenge Discovery automatically identifies model weaknesses
  • Comprehensive Analysis with detailed performance metrics
  • Containerized Deployment with Docker support
  • API Compatible with OpenAI-compatible endpoints
  • Extensible Architecture for custom components

Support

Resource Description
Troubleshooting Common issues and solutions
GitHub Discussions Community support and questions
Issue Tracker Bug reports and feature requests

Contributing

We welcome contributions to PrismBench! Whether you're fixing bugs, adding features, or improving documentation, your help is appreciated.

Contributing Guide →


Related Pages

Get Started

Core Framework

Advanced Usage


Made with enough ☕️ to fell an elephant and a whole lot of ❤️ by Ahura Majdinasab - SWAT Lab - Polytechnique Montreal

Clone this wiki locally