Unify CLI configuration & deduplicate truncate#31
Conversation
…removing duplication across entrypoints
geoalgo
left a comment
There was a problem hiding this comment.
This is a good PR thanks @kargibora ! Just have small comments.
Also @ErlisLushtaku can you let us know if this PR will impact your workflow?
If not, happy to merge after my small comments are addressed.
judgearena/estimate_elo_ratings.py
Outdated
| arena: str = "" | ||
| model: str = "" |
There was a problem hiding this comment.
| arena: str = "" | |
| model: str = "" | |
| arena: str | None = None | |
| model: str | None = None |
| model: str | ||
| judge_model: str | ||
| n_instructions: int | None = None | ||
| class CliEloArgs(BaseCliArgs): |
There was a problem hiding this comment.
Note: We pay the price of inheriting a dataclass by requiring all members to have default bellow even if they do not make sense (we cannot have arena not initialized).
This is OK but we may want to avoid dataclass in the future if it becomes too messy.
Could you leave a comment indicating this?
| @@ -137,79 +77,19 @@ def parse_args(cls): | |||
| help="Model name to anchor at 1000 ELO. All other ratings are expressed relative to this model. " | |||
| "Must be one of the models present in the arena battles. If not set, ratings are not anchored.", | |||
| ) | |||
| parser.add_argument( | |||
| "--truncate_all_input_chars", | |||
| type=int, | |||
| required=False, | |||
| default=8192, | |||
| help="Character-level truncation applied before tokenization: truncates each instruction " | |||
| "before model A/B generation and truncates each completion before judge evaluation.", | |||
| ) | |||
| parser.add_argument( | |||
| "--max_out_tokens_models", | |||
| type=int, | |||
| required=False, | |||
| default=32768, | |||
| help=( | |||
| "Generation token budget for each model A/B response. For VLLM, keep this <= " | |||
| "--max_model_len (if provided)." | |||
| ), | |||
| ) | |||
| parser.add_argument( | |||
| "--max_out_tokens_judge", | |||
| type=int, | |||
| required=False, | |||
| default=32768, | |||
| help=( | |||
| "Generation token budget for the judge response (reasoning + scores). For " | |||
| "VLLM, keep this <= --max_model_len (if provided)." | |||
| ), | |||
| ) | |||
| parser.add_argument( | |||
| "--max_model_len", | |||
| type=int, | |||
| required=False, | |||
| default=None, | |||
| help=( | |||
| "Optional total context window for VLLM models (prompt + generation). This is " | |||
| "independent from --max_out_tokens_models/--max_out_tokens_judge, which only cap " | |||
| "generated tokens. This is useful on smaller GPUs to avoid OOM." | |||
| ), | |||
| ) | |||
| parser.add_argument( | |||
| "--chat_template", | |||
| type=str, | |||
| required=False, | |||
| default=None, | |||
| help="Jinja2 chat template string to use instead of the model's tokenizer template. " | |||
| "If not provided, ChatML is used as fallback for models without a chat template.", | |||
| ) | |||
| parser.add_argument( | |||
| "--engine_kwargs", | |||
| type=str, | |||
| required=False, | |||
| default="{}", | |||
| help=( | |||
| "JSON dict of engine-specific kwargs forwarded to the underlying engine. " | |||
| 'Example for vLLM: \'{"tensor_parallel_size": 2, "gpu_memory_utilization": 0.9}\'.' | |||
| ), | |||
| ) | |||
judgearena/generate_and_evaluate.py
Outdated
| class CliArgs(BaseCliArgs): | ||
| """CLI arguments for the generate-and-evaluate entrypoint.""" | ||
|
|
||
| dataset: str = "" |
There was a problem hiding this comment.
same here, better have str | None = None
ErlisLushtaku
left a comment
There was a problem hiding this comment.
LGTM, just a couple of comments to make sure we don't break or duplicate anything after getting the mt-bench changes from main
judgearena/cli_common.py
Outdated
| help=( | ||
| "Model comparison order mode. 'fixed': always use model order A-B. " | ||
| "'both': correct for model order bias by evaluating each instruction " | ||
| "twice, once as A-B and once as B-A, and average. This helps account " |
There was a problem hiding this comment.
nit: We do not actually average (we concatenate), could you please update the description to not mention "averaging" same as here
There was a problem hiding this comment.
I had added a config file as part of the mt-bench PR: https://github.com/OpenEuroLLM/JudgeArena/pull/21/changes#diff-ee48e912ce7c7303506e9f981cc6fb9ade571a043a6acbc6fa77de513b9248fd
Could you please merge main and remove that so we don't duplicate it
| def truncate(s: str, max_len: int | None = None) -> str: | ||
| """Truncate a string to *max_len* characters. | ||
|
|
||
| Non-string inputs (e.g. ``None`` or ``float('nan')``) are coerced to the | ||
| empty string so that callers don't have to guard against missing data. | ||
| """ | ||
| if not isinstance(s, str): | ||
| return "" | ||
| if max_len is not None: | ||
| return s[:max_len] | ||
| return s | ||
|
|
||
|
|
There was a problem hiding this comment.
Also we added an additional function safe_text after this here, make sure we don't lose that after merging main, as it would break mt-bench
There was a problem hiding this comment.
We can merge this branch after the mt-bench or vice-versa and simply insert safe_text
There was a problem hiding this comment.
I have inserted safe_text thanks for the head up
69c83dc to
9dc32b0
Compare
Problem
The codebase had two kinds of duplication that made maintenance error-prone:
Two independent
truncateimplementations — one as a top-level function ingenerate.pyand another nested insideannotate_battles()inevaluate.py. The two behaved differently: the nested version handled non-string inputs gracefully while the top-level one did not. A bug fix in one would never reach the other.12 identical CLI flags copy-pasted across entrypoints —
generate_and_evaluate.pyandestimate_elo_ratings.pyeach defined the same--engine,--judge,--swap_mode,--max_new_tokens, etc. with their own argparse blocks and dataclass fields. Any default change or new shared flag required editing both files in lock-step, with no compiler or test to catch drift.Additionally,
swap_modeaccepted arbitrary strings with no validation — a typo like"boht"would silently propagate through the pipeline and produce unexpected results.Solution
judgearena/cli_common.pyintroducesBaseCliArgs(a dataclass with the 12 shared fields),add_common_arguments()(a single argparse registration point), andparse_engine_kwargs()(safe JSON parsing).BaseCliArgs.__post_init__validatesswap_mode ∈ {"fixed", "both"}at construction time.judgearena/utils.pynow owns the single canonicaltruncate(s, max_len)using the safer non-string guard fromevaluate.py.Both entrypoints (
generate_and_evaluate.py,estimate_elo_ratings.py) inherit fromBaseCliArgsand declare only their unique fields, cutting ~200 lines of duplicated argument definitions.Files changed
judgearena/cli_common.pyBaseCliArgs,add_common_arguments(),parse_engine_kwargs()judgearena/utils.pytruncate()judgearena/generate.pytruncate(); imports fromutilsjudgearena/evaluate.pytruncate(); imports fromutilsjudgearena/generate_and_evaluate.pyCliArgs→ inheritsBaseCliArgs(4 unique fields remain)judgearena/estimate_elo_ratings.pyCliEloArgs→ inheritsBaseCliArgs(7 unique fields remain); removed unusedjson/fieldimportsTesting
The two warnings are pre-existing deprecations in
transformersandlangchain— unrelated to this change.