Skip to content

Refactor/consolidate eval pipelines#33

Open
kargibora wants to merge 4 commits intorefactor/unify-cli-config-dedup-truncatefrom
refactor/consolidate-eval-pipelines
Open

Refactor/consolidate eval pipelines#33
kargibora wants to merge 4 commits intorefactor/unify-cli-config-dedup-truncatefrom
refactor/consolidate-eval-pipelines

Conversation

@kargibora
Copy link
Copy Markdown
Collaborator

Consolidate evaluation pipelines

Summary

Both evaluate.py (evaluate_completions) and generate_and_evaluate.py (main)
independently build annotation DataFrames, compute preference summaries, serialize
results to JSON/CSV, and write run metadata — with slightly different code each time.
This PR extracts that shared logic into reusable helpers so both pipelines go through
the same path.

Depends on: #31

What changed

New / changed Description
EvaluationResult dataclass Pure-data container holding annotations_df, prefs, summary, and run_config. No I/O.
build_annotation_dataframe() Builds a single DataFrame from forward (and optionally reversed) annotations, replacing duplicated concat-and-label logic in both pipelines.
build_evaluation_result() Computes preference summary and packages everything into an EvaluationResult. Pure logic — no disk access.
save_evaluation_result() Writes an EvaluationResult to disk (CSV + JSON + run metadata). Single place for all serialization.

Files touched (relative to PR 1)

  • judgearena/evaluate.py — Added the four helpers above; refactored evaluate_completions to use them (+220 / −6)
  • judgearena/generate_and_evaluate.py — Replaced ~60 lines of inline DataFrame building, summary computation, JSON writing, and metadata writing with calls to the new helpers (+39 / −55)

Before → After

Before: each pipeline had its own copy of:

df = pd.DataFrame(annotations)
df["instruction_index"] = ...
df["model_A"] = ...
# ... swap_mode == "both" handling ...
df.to_csv(...)
summary = compute_pref_summary(prefs)
json.dump(...)
try:
    write_run_metadata(...)
except OSError: ...

**After:**
```python
annotations_df = build_annotation_dataframe(...)
eval_result    = build_evaluation_result(annotations_df=..., prefs=..., run_config=...)
save_evaluation_result(eval_result, output_dir=..., ...)

class EvaluationResult:
"""Pure-data container for an evaluation run's outputs.

Holds the computed summary, preferences, annotation DataFrame, and the
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this explanation might be redundant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants