Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2026 Erlis Lushtaku, David Salinas
Copyright 2026 Erlis Lushtaku, David Salinas, and GitHub contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
15 changes: 10 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,15 +196,20 @@ This override applies to all vLLM models in the run. For remote providers (OpenA
| Dataset | Description |
|-----------------------|------------------------------------------------------------------------------------------------|
| `alpaca-eval` | General instruction-following benchmark |
| `arena-hard` | More challenging evaluation suite |
| `arena-hard-v2.0` | Arena-Hard v2.0 from official `lmarena-ai/arena-hard-auto` source |
| `arena-hard-v0.1` | Legacy Arena-Hard v0.1 from official `lmarena-ai/arena-hard-auto` source |
| `m-arena-hard` | Translated version of Arena-Hard in 23 languages |
| `m-arena-hard-{lang}` | Language-specific variants (e.g., `ar`, `cs`, `de`) |
| `m-arena-hard-EU` | All EU languages combined |
| `fluency-{lang}` | Fluency evaluation for pretrained models (`finnish`, `french`, `german`, `spanish`, `swedish`) |

For Arena-Hard, JudgeArena resolves baseline metadata by dataset version:
- `arena-hard-v0.1`: `gpt-4-0314`
- `arena-hard-v2.0`: `o3-mini-2025-01-31` (standard prompts)

## 📈 Estimating ELO Ratings

OpenJury can estimate the ELO rating of a model by running it against opponents sampled from a human preference arena (`LMArena-100k`, `LMArena-140k`, or `ComparIA`).
JudgeArena can estimate the ELO rating of a model by running it against opponents sampled from a human preference arena (`LMArena-100k`, `LMArena-140k`, or `ComparIA`).
The LLM judge scores each battle, and the resulting ratings are computed using the Bradley-Terry model anchored against the human-annotated arena leaderboard.

### Quick start
Expand All @@ -220,7 +225,7 @@ judgearena-elo \
Alternatively, if running directly from the repository without installing:

```bash
uv run python openjury/estimate_elo_ratings.py \
uv run python judgearena/estimate_elo_ratings.py \
--arena ComparIA \
--model Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
Expand All @@ -232,8 +237,8 @@ uv run python openjury/estimate_elo_ratings.py \
| Flag | Default | Description |
|---|---|---|
| `--arena` | `ComparIA` | Arena to sample opponents from: `LMArena-100k`, `LMArena-140k`, or `ComparIA` |
| `--model` | *(required)* | Model under evaluation (same format as `openjury`) |
| `--judge_model` | *(required)* | LLM judge (same format as `openjury`) |
| `--model` | *(required)* | Model under evaluation (same format as `judgearena`) |
| `--judge_model` | *(required)* | LLM judge (same format as `judgearena`) |
| `--n_instructions` | all | Number of arena battles to use for evaluation |
| `--n_instructions_per_language` | all | Cap battles per language (useful for balanced multilingual eval) |
| `--languages` | all | Restrict to specific language codes, e.g. `en fr de` |
Expand Down
4 changes: 2 additions & 2 deletions TODOs.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
TODOs:
* push on pypi
* document on the fly evaluations with custom prompt
* support MT-bench
* handle errors
* CI [high/large]
* implement CI judge option
* implement domain filter in CI (maybe pass a regexp by column?)
* report cost?

Done:
* push on pypi
* support MT-bench
* support alpaca-eval
* support arena-hard
* test together judge
Expand Down
45 changes: 42 additions & 3 deletions judgearena/arenas_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,53 @@ def _extract_instruction_text(turn: dict) -> str:
return " ".join(block["text"] for block in content if block.get("type") == "text")


KNOWN_ARENAS = ["LMArena-100k", "LMArena-140k", "ComparIA"]
KNOWN_ARENAS = ["LMArena-100k", "LMArena-55k", "LMArena-140k", "ComparIA"]


def _load_arena_dataframe(
arena: str, comparia_revision: str | None = None
) -> pd.DataFrame:
assert arena in KNOWN_ARENAS
if "LMArena" in arena:
if arena == "LMArena-55k":
path = snapshot_download(
repo_id="lmarena-ai/arena-human-preference-55k",
repo_type="dataset",
allow_patterns="*.csv",
force_download=False,
)
df = pd.read_csv(Path(path) / "train.csv")

def _winner_55k(row) -> str | None:
if row["winner_tie"]:
return "tie"
if row["winner_model_a"]:
return "model_a"
if row["winner_model_b"]:
return "model_b"
return None

df["winner"] = df.apply(_winner_55k, axis=1)
df = df[df["winner"].notna()].copy()

df["conversation_a"] = df.apply(
lambda r: [
{"role": "user", "content": str(r["prompt"])},
{"role": "assistant", "content": str(r["response_a"])},
],
axis=1,
)
df["conversation_b"] = df.apply(
lambda r: [
{"role": "user", "content": str(r["prompt"])},
{"role": "assistant", "content": str(r["response_b"])},
],
axis=1,
)
df["question_id"] = df["id"]
df["tstamp"] = 0
df["benchmark"] = "LMArena-55k"

elif "LMArena" in arena:
size = arena.split("-")[1] # "100k" or "140k"
path = snapshot_download(
repo_id=f"lmarena-ai/arena-human-preference-{size}",
Expand Down Expand Up @@ -139,7 +178,7 @@ def load_arena_dataframe(
if arena is None:
arenas = KNOWN_ARENAS
elif arena == "LMArena":
arenas = ["LMArena-100k", "LMArena-140k"]
arenas = ["LMArena-100k", "LMArena-55k", "LMArena-140k"]
else:
return _load_arena_dataframe(arena, comparia_revision)
return pd.concat(
Expand Down
Loading