Refactor/consolidate eval pipelines by kargibora · Pull Request #33 · OpenEuroLLM/JudgeArena

kargibora · 2026-04-07T09:05:45Z

Consolidate evaluation pipelines

Summary

Both evaluate.py (evaluate_completions) and generate_and_evaluate.py (main)
independently build annotation DataFrames, compute preference summaries, serialize
results to JSON/CSV, and write run metadata — with slightly different code each time.
This PR extracts that shared logic into reusable helpers so both pipelines go through
the same path.

Depends on: #31

What changed

New / changed	Description
`EvaluationResult` dataclass	Pure-data container holding `annotations_df`, `prefs`, `summary`, and `run_config`. No I/O.
`build_annotation_dataframe()`	Builds a single DataFrame from forward (and optionally reversed) annotations, replacing duplicated concat-and-label logic in both pipelines.
`build_evaluation_result()`	Computes preference summary and packages everything into an `EvaluationResult`. Pure logic — no disk access.
`save_evaluation_result()`	Writes an `EvaluationResult` to disk (CSV + JSON + run metadata). Single place for all serialization.

Files touched (relative to PR 1)

judgearena/evaluate.py — Added the four helpers above; refactored evaluate_completions to use them (+220 / −6)
judgearena/generate_and_evaluate.py — Replaced ~60 lines of inline DataFrame building, summary computation, JSON writing, and metadata writing with calls to the new helpers (+39 / −55)

Before → After

Before: each pipeline had its own copy of:

df = pd.DataFrame(annotations)
df["instruction_index"] = ...
df["model_A"] = ...
# ... swap_mode == "both" handling ...
df.to_csv(...)
summary = compute_pref_summary(prefs)
json.dump(...)
try:
    write_run_metadata(...)
except OSError: ...

**After:**
```python
annotations_df = build_annotation_dataframe(...)
eval_result    = build_evaluation_result(annotations_df=..., prefs=..., run_config=...)
save_evaluation_result(eval_result, output_dir=..., ...)

…/consolidate-eval-pipelines

ErlisLushtaku · 2026-04-07T13:36:33Z

judgearena/evaluate.py

+class EvaluationResult:
+    """Pure-data container for an evaluation run's outputs.
+
+    Holds the computed summary, preferences, annotation DataFrame, and the


nit: this explanation might be redundant

kargibora added 4 commits April 2, 2026 15:32

Refactor evaluation logic

3e272ac

update inheritence and solve mt-bench merge problems

69c83dc

Merge branch 'refactor/unify-cli-config-dedup-truncate' into refactor…

6e119a8

…/consolidate-eval-pipelines

Merge branch 'refactor/unify-cli-config-dedup-truncate' into refactor…

003adf0

…/consolidate-eval-pipelines

ErlisLushtaku reviewed Apr 7, 2026

View reviewed changes

ErlisLushtaku approved these changes Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/consolidate eval pipelines#33

Refactor/consolidate eval pipelines#33
kargibora wants to merge 4 commits intorefactor/unify-cli-config-dedup-truncatefrom
refactor/consolidate-eval-pipelines

kargibora commented Apr 7, 2026

Uh oh!

ErlisLushtaku Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kargibora commented Apr 7, 2026

Consolidate evaluation pipelines

Summary

What changed

Files touched (relative to PR 1)

Before → After

Uh oh!

ErlisLushtaku Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants