Unify CLI configuration & deduplicate `truncate` by kargibora · Pull Request #31 · OpenEuroLLM/JudgeArena

kargibora · 2026-04-02T12:18:35Z

Problem

The codebase had two kinds of duplication that made maintenance error-prone:

Two independent truncate implementations — one as a top-level function in generate.py and another nested inside annotate_battles() in evaluate.py. The two behaved differently: the nested version handled non-string inputs gracefully while the top-level one did not. A bug fix in one would never reach the other.
12 identical CLI flags copy-pasted across entrypoints — generate_and_evaluate.py and estimate_elo_ratings.py each defined the same --engine, --judge, --swap_mode, --max_new_tokens, etc. with their own argparse blocks and dataclass fields. Any default change or new shared flag required editing both files in lock-step, with no compiler or test to catch drift.

Additionally, swap_mode accepted arbitrary strings with no validation — a typo like "boht" would silently propagate through the pipeline and produce unexpected results.

Solution

judgearena/cli_common.py introduces BaseCliArgs (a dataclass with the 12 shared fields), add_common_arguments() (a single argparse registration point), and parse_engine_kwargs() (safe JSON parsing). BaseCliArgs.__post_init__ validates swap_mode ∈ {"fixed", "both"} at construction time.
judgearena/utils.py now owns the single canonical truncate(s, max_len) using the safer non-string guard from evaluate.py.
Both entrypoints (generate_and_evaluate.py, estimate_elo_ratings.py) inherit from BaseCliArgs and declare only their unique fields, cutting ~200 lines of duplicated argument definitions.

Files changed

File	What changed
`judgearena/cli_common.py`	New — `BaseCliArgs`, `add_common_arguments()`, `parse_engine_kwargs()`
`judgearena/utils.py`	Added canonical `truncate()`
`judgearena/generate.py`	Removed local `truncate()`; imports from `utils`
`judgearena/evaluate.py`	Removed nested `truncate()`; imports from `utils`
`judgearena/generate_and_evaluate.py`	`CliArgs` → inherits `BaseCliArgs` (4 unique fields remain)
`judgearena/estimate_elo_ratings.py`	`CliEloArgs` → inherits `BaseCliArgs` (7 unique fields remain); removed unused `json`/`field` imports

Testing

$ uv run pytest
27 passed, 2 warnings in 60.85s

The two warnings are pre-existing deprecations in transformers and langchain — unrelated to this change.

…removing duplication across entrypoints

geoalgo

This is a good PR thanks @kargibora ! Just have small comments.
Also @ErlisLushtaku can you let us know if this PR will impact your workflow?
If not, happy to merge after my small comments are addressed.

geoalgo · 2026-04-02T15:37:48Z

judgearena/estimate_elo_ratings.py

+    arena: str = ""
+    model: str = ""


Suggested change

arena: str = ""

model: str = ""

arena: str | None = None

model: str | None = None

geoalgo · 2026-04-02T15:39:03Z

judgearena/estimate_elo_ratings.py

-    model: str
-    judge_model: str
-    n_instructions: int | None = None
+class CliEloArgs(BaseCliArgs):


Note: We pay the price of inheriting a dataclass by requiring all members to have default bellow even if they do not make sense (we cannot have arena not initialized).
This is OK but we may want to avoid dataclass in the future if it becomes too messy.
Could you leave a comment indicating this?

geoalgo · 2026-04-02T15:39:57Z

judgearena/estimate_elo_ratings.py

@@ -137,79 +77,19 @@ def parse_args(cls):
            help="Model name to anchor at 1000 ELO. All other ratings are expressed relative to this model. "
            "Must be one of the models present in the arena battles. If not set, ratings are not anchored.",
        )
-        parser.add_argument(
-            "--truncate_all_input_chars",
-            type=int,
-            required=False,
-            default=8192,
-            help="Character-level truncation applied before tokenization: truncates each instruction "
-            "before model A/B generation and truncates each completion before judge evaluation.",
-        )
-        parser.add_argument(
-            "--max_out_tokens_models",
-            type=int,
-            required=False,
-            default=32768,
-            help=(
-                "Generation token budget for each model A/B response. For VLLM, keep this <= "
-                "--max_model_len (if provided)."
-            ),
-        )
-        parser.add_argument(
-            "--max_out_tokens_judge",
-            type=int,
-            required=False,
-            default=32768,
-            help=(
-                "Generation token budget for the judge response (reasoning + scores). For "
-                "VLLM, keep this <= --max_model_len (if provided)."
-            ),
-        )
-        parser.add_argument(
-            "--max_model_len",
-            type=int,
-            required=False,
-            default=None,
-            help=(
-                "Optional total context window for VLLM models (prompt + generation). This is "
-                "independent from --max_out_tokens_models/--max_out_tokens_judge, which only cap "
-                "generated tokens. This is useful on smaller GPUs to avoid OOM."
-            ),
-        )
-        parser.add_argument(
-            "--chat_template",
-            type=str,
-            required=False,
-            default=None,
-            help="Jinja2 chat template string to use instead of the model's tokenizer template. "
-            "If not provided, ChatML is used as fallback for models without a chat template.",
-        )
-        parser.add_argument(
-            "--engine_kwargs",
-            type=str,
-            required=False,
-            default="{}",
-            help=(
-                "JSON dict of engine-specific kwargs forwarded to the underlying engine. "
-                'Example for vLLM: \'{"tensor_parallel_size": 2, "gpu_memory_utilization": 0.9}\'.'
-            ),
-        )


geoalgo · 2026-04-02T15:40:33Z

judgearena/generate_and_evaluate.py

+class CliArgs(BaseCliArgs):
+    """CLI arguments for the generate-and-evaluate entrypoint."""
+
+    dataset: str = ""


same here, better have str | None = None

ErlisLushtaku

LGTM, just a couple of comments to make sure we don't break or duplicate anything after getting the mt-bench changes from main

ErlisLushtaku · 2026-04-03T11:53:24Z

judgearena/cli_common.py

+        help=(
+            "Model comparison order mode. 'fixed': always use model order A-B. "
+            "'both': correct for model order bias by evaluating each instruction "
+            "twice, once as A-B and once as B-A, and average. This helps account "


nit: We do not actually average (we concatenate), could you please update the description to not mention "averaging" same as here

ErlisLushtaku · 2026-04-03T11:57:48Z

judgearena/cli_common.py

I had added a config file as part of the mt-bench PR: https://github.com/OpenEuroLLM/JudgeArena/pull/21/changes#diff-ee48e912ce7c7303506e9f981cc6fb9ade571a043a6acbc6fa77de513b9248fd

Could you please merge main and remove that so we don't duplicate it

ErlisLushtaku · 2026-04-03T12:05:17Z

judgearena/utils.py

+def truncate(s: str, max_len: int | None = None) -> str:
+    """Truncate a string to *max_len* characters.
+
+    Non-string inputs (e.g. ``None`` or ``float('nan')``) are coerced to the
+    empty string so that callers don't have to guard against missing data.
+    """
+    if not isinstance(s, str):
+        return ""
+    if max_len is not None:
+        return s[:max_len]
+    return s
+
+


Also we added an additional function safe_text after this here, make sure we don't lose that after merging main, as it would break mt-bench

We can merge this branch after the mt-bench or vice-versa and simply insert safe_text

I have inserted safe_text thanks for the head up

…nfig-dedup-truncate

Refactor CLI argument handling by unifying common configurations and …

d0dd870

…removing duplication across entrypoints

geoalgo reviewed Apr 2, 2026

View reviewed changes

ErlisLushtaku reviewed Apr 3, 2026

View reviewed changes

kargibora added 2 commits April 7, 2026 10:03

Merge remote-tracking branch 'origin/main' into refactor/unify-cli-co…

4cd8c93

…nfig-dedup-truncate

update inheritence and solve mt-bench merge problems

9dc32b0

kargibora force-pushed the refactor/unify-cli-config-dedup-truncate branch from 69c83dc to 9dc32b0 Compare April 7, 2026 08:35

remove unused import

28c57fd

kargibora mentioned this pull request Apr 7, 2026

Refactor/consolidate eval pipelines #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify CLI configuration & deduplicate `truncate`#31

Unify CLI configuration & deduplicate `truncate`#31
kargibora wants to merge 4 commits intomainfrom
refactor/unify-cli-config-dedup-truncate

kargibora commented Apr 2, 2026

Uh oh!

geoalgo left a comment

Uh oh!

geoalgo Apr 2, 2026

Uh oh!

geoalgo Apr 2, 2026

Uh oh!

geoalgo Apr 2, 2026

Uh oh!

geoalgo Apr 2, 2026

Uh oh!

ErlisLushtaku left a comment

Uh oh!

ErlisLushtaku Apr 3, 2026

Uh oh!

ErlisLushtaku Apr 3, 2026

Uh oh!

ErlisLushtaku Apr 3, 2026

Uh oh!

kargibora Apr 7, 2026

Uh oh!

kargibora Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kargibora commented Apr 2, 2026

Problem

Solution

Files changed

Testing

Uh oh!

geoalgo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants