MLX Benchmark

A comprehensive benchmarking tool for MLX models on Apple Silicon.

Features

📊 Comprehensive Metrics: Measures throughput, TTFT, token latency
🔄 Multiple Test Scenarios: Short, medium, and long generation tests
📈 Streaming Benchmarks: Measures real-time streaming performance
🎯 Consistent Results: Multiple runs with statistical averaging
💻 Apple Silicon Optimized: Built specifically for MLX framework

Installation

Prerequisites

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10 or higher

Setup

# Clone the repository
git clone <your-repo-url>
cd mlx-benchmark

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Usage

Basic Benchmark

python benchmark.py --model-path /path/to/your/mlx-model

Custom Number of Runs

python benchmark.py --model-path /path/to/your/mlx-model --runs 5

Example with LM Studio Model

python benchmark.py \
  --model-path ~/.lmstudio/models/lmstudio-community/Qwen3-Coder-Next-MLX-6bit \
  --runs 3

Benchmark Tests

The tool runs four types of benchmarks:

1. Short Code Generation (50 tokens)

Tests quick code completions and snippets

2. Medium Code Generation (150 tokens)

Tests function and class implementations

3. Long Code Generation (300 tokens)

Tests complex code generation scenarios

4. Streaming Benchmark

Measures real-time streaming performance:

Time to First Token (TTFT)
Per-token latency
Streaming throughput

Metrics Explained

Throughput: Tokens generated per second (tokens/s)
TTFT: Time to First Token - how quickly the model starts responding
Token Latency: Average time between each token during streaming
Total Time: Complete generation time including prompt processing

Example Output

======================================================================
  MLX Model Benchmark
======================================================================
Model: Qwen3-Coder-Next-MLX-6bit
Path:  ~/.lmstudio/models/lmstudio-community/Qwen3-Coder-Next-MLX-6bit
Runs:  3

Loading model... Done! (13.82s)

======================================================================
  Short Code Generation (50 tokens)
======================================================================

Run 1:
  Prompt tokens:     9
  Generated tokens:  40
  Total time:        0.820s
  Throughput:        48.79 tokens/s

Average (excluding warmup):
  Throughput:        48.88 tokens/s
  Time per run:      0.818s

Performance Expectations

Typical performance on Apple Silicon:

Chip	Model Size	Expected Throughput
M1	7B (6-bit)	30-50 tokens/s
M2	7B (6-bit)	40-60 tokens/s
M3 Pro/Max	7B (6-bit)	50-70 tokens/s
M3 Max	14B (6-bit)	35-55 tokens/s

Development

Project Structure

mlx-benchmark/
├── benchmark.py         # Main benchmark script
├── requirements.txt     # Python dependencies
├── README.md           # This file
└── .gitignore          # Git ignore patterns

Adding New Benchmarks

To add a new benchmark test, edit the test_cases list in benchmark.py:

test_cases = [
    {
        "name": "Your Test Name",
        "prompt": "Your prompt here",
        "max_tokens": 100
    }
]

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Acknowledgments

Built with MLX by Apple
Uses mlx-lm for language model inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX Benchmark

Features

Installation

Prerequisites

Setup

Usage

Basic Benchmark

Custom Number of Runs

Example with LM Studio Model

Benchmark Tests

1. Short Code Generation (50 tokens)

2. Medium Code Generation (150 tokens)

3. Long Code Generation (300 tokens)

4. Streaming Benchmark

Metrics Explained

Example Output

Performance Expectations

Development

Project Structure

Adding New Benchmarks

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MLX Benchmark

Features

Installation

Prerequisites

Setup

Usage

Basic Benchmark

Custom Number of Runs

Example with LM Studio Model

Benchmark Tests

1. Short Code Generation (50 tokens)

2. Medium Code Generation (150 tokens)

3. Long Code Generation (300 tokens)

4. Streaming Benchmark

Metrics Explained

Example Output

Performance Expectations

Development

Project Structure

Adding New Benchmarks

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages