Skip to content

Unify CLI configuration & deduplicate truncate#31

Open
kargibora wants to merge 4 commits intomainfrom
refactor/unify-cli-config-dedup-truncate
Open

Unify CLI configuration & deduplicate truncate#31
kargibora wants to merge 4 commits intomainfrom
refactor/unify-cli-config-dedup-truncate

Conversation

@kargibora
Copy link
Copy Markdown
Collaborator

Problem

The codebase had two kinds of duplication that made maintenance error-prone:

  1. Two independent truncate implementations — one as a top-level function in generate.py and another nested inside annotate_battles() in evaluate.py. The two behaved differently: the nested version handled non-string inputs gracefully while the top-level one did not. A bug fix in one would never reach the other.

  2. 12 identical CLI flags copy-pasted across entrypointsgenerate_and_evaluate.py and estimate_elo_ratings.py each defined the same --engine, --judge, --swap_mode, --max_new_tokens, etc. with their own argparse blocks and dataclass fields. Any default change or new shared flag required editing both files in lock-step, with no compiler or test to catch drift.

Additionally, swap_mode accepted arbitrary strings with no validation — a typo like "boht" would silently propagate through the pipeline and produce unexpected results.

Solution

  • judgearena/cli_common.py introduces BaseCliArgs (a dataclass with the 12 shared fields), add_common_arguments() (a single argparse registration point), and parse_engine_kwargs() (safe JSON parsing). BaseCliArgs.__post_init__ validates swap_mode ∈ {"fixed", "both"} at construction time.

  • judgearena/utils.py now owns the single canonical truncate(s, max_len) using the safer non-string guard from evaluate.py.

  • Both entrypoints (generate_and_evaluate.py, estimate_elo_ratings.py) inherit from BaseCliArgs and declare only their unique fields, cutting ~200 lines of duplicated argument definitions.

Files changed

File What changed
judgearena/cli_common.py NewBaseCliArgs, add_common_arguments(), parse_engine_kwargs()
judgearena/utils.py Added canonical truncate()
judgearena/generate.py Removed local truncate(); imports from utils
judgearena/evaluate.py Removed nested truncate(); imports from utils
judgearena/generate_and_evaluate.py CliArgs → inherits BaseCliArgs (4 unique fields remain)
judgearena/estimate_elo_ratings.py CliEloArgs → inherits BaseCliArgs (7 unique fields remain); removed unused json/field imports

Testing

$ uv run pytest
27 passed, 2 warnings in 60.85s

The two warnings are pre-existing deprecations in transformers and langchain — unrelated to this change.

Copy link
Copy Markdown
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good PR thanks @kargibora ! Just have small comments.
Also @ErlisLushtaku can you let us know if this PR will impact your workflow?
If not, happy to merge after my small comments are addressed.

Comment on lines +21 to +22
arena: str = ""
model: str = ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arena: str = ""
model: str = ""
arena: str | None = None
model: str | None = None

model: str
judge_model: str
n_instructions: int | None = None
class CliEloArgs(BaseCliArgs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We pay the price of inheriting a dataclass by requiring all members to have default bellow even if they do not make sense (we cannot have arena not initialized).
This is OK but we may want to avoid dataclass in the future if it becomes too messy.
Could you leave a comment indicating this?

Comment on lines 69 to -196
@@ -137,79 +77,19 @@ def parse_args(cls):
help="Model name to anchor at 1000 ELO. All other ratings are expressed relative to this model. "
"Must be one of the models present in the arena battles. If not set, ratings are not anchored.",
)
parser.add_argument(
"--truncate_all_input_chars",
type=int,
required=False,
default=8192,
help="Character-level truncation applied before tokenization: truncates each instruction "
"before model A/B generation and truncates each completion before judge evaluation.",
)
parser.add_argument(
"--max_out_tokens_models",
type=int,
required=False,
default=32768,
help=(
"Generation token budget for each model A/B response. For VLLM, keep this <= "
"--max_model_len (if provided)."
),
)
parser.add_argument(
"--max_out_tokens_judge",
type=int,
required=False,
default=32768,
help=(
"Generation token budget for the judge response (reasoning + scores). For "
"VLLM, keep this <= --max_model_len (if provided)."
),
)
parser.add_argument(
"--max_model_len",
type=int,
required=False,
default=None,
help=(
"Optional total context window for VLLM models (prompt + generation). This is "
"independent from --max_out_tokens_models/--max_out_tokens_judge, which only cap "
"generated tokens. This is useful on smaller GPUs to avoid OOM."
),
)
parser.add_argument(
"--chat_template",
type=str,
required=False,
default=None,
help="Jinja2 chat template string to use instead of the model's tokenizer template. "
"If not provided, ChatML is used as fallback for models without a chat template.",
)
parser.add_argument(
"--engine_kwargs",
type=str,
required=False,
default="{}",
help=(
"JSON dict of engine-specific kwargs forwarded to the underlying engine. "
'Example for vLLM: \'{"tensor_parallel_size": 2, "gpu_memory_utilization": 0.9}\'.'
),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :)

class CliArgs(BaseCliArgs):
"""CLI arguments for the generate-and-evaluate entrypoint."""

dataset: str = ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, better have str | None = None

Copy link
Copy Markdown
Collaborator

@ErlisLushtaku ErlisLushtaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of comments to make sure we don't break or duplicate anything after getting the mt-bench changes from main

help=(
"Model comparison order mode. 'fixed': always use model order A-B. "
"'both': correct for model order bias by evaluating each instruction "
"twice, once as A-B and once as B-A, and average. This helps account "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We do not actually average (we concatenate), could you please update the description to not mention "averaging" same as here

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added a config file as part of the mt-bench PR: https://github.com/OpenEuroLLM/JudgeArena/pull/21/changes#diff-ee48e912ce7c7303506e9f981cc6fb9ade571a043a6acbc6fa77de513b9248fd

Could you please merge main and remove that so we don't duplicate it

Comment on lines +72 to +84
def truncate(s: str, max_len: int | None = None) -> str:
"""Truncate a string to *max_len* characters.

Non-string inputs (e.g. ``None`` or ``float('nan')``) are coerced to the
empty string so that callers don't have to guard against missing data.
"""
if not isinstance(s, str):
return ""
if max_len is not None:
return s[:max_len]
return s


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we added an additional function safe_text after this here, make sure we don't lose that after merging main, as it would break mt-bench

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this branch after the mt-bench or vice-versa and simply insert safe_text

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have inserted safe_text thanks for the head up

@kargibora kargibora force-pushed the refactor/unify-cli-config-dedup-truncate branch from 69c83dc to 9dc32b0 Compare April 7, 2026 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants