feat: structured logging by kargibora · Pull Request #37 · OpenEuroLLM/JudgeArena

kargibora · 2026-04-15T09:01:27Z

Summary

Replace ad-hoc print() calls across the codebase with Python's logging module under a unified judgearena logger namespace. This gives users control over verbosity and enables persistent debug logs without code changes.

I totally think this is required and its better to switch to using logging package than late as its common practice (also development friendly). The only concern is the log files. It is usually hard to read log directly from console so its better to save log in a rotation for debugging and results.

Changes

New: `judgearena/log.py`

get_logger(name) — returns a child logger under the judgearena namespace; auto-prefixes bare module names.
configure_logging(verbosity, log_file) — sets up console + optional file handlers. Supports JUDGEARENA_LOG_LEVEL env-var override.
attach_file_handler(path) — adds a DEBUG-level file handler (always captures full trace regardless of console verbosity).
make_run_log_path(folder) — generates a timestamped run-YYYYMMDD_HHMMSS.log path.

New CLI flags (`judgearena/cli_common.py`)

Flag	Effect
`-v` / `--verbose`	Set console to DEBUG
`-q` / `--quiet`	Suppress everything below WARNING
`--log-file PATH`	Explicit log file location
`--no-log-file`	Disable automatic `run-*.log` in the result folder

Added resolve_verbosity(args) helper — -q takes precedence over -v.

Codebase migration

Replaced print() with logger.info() / logger.debug() in:

evaluate.py, generate_and_evaluate.py, estimate_elo_ratings.py
arenas_utils.py, eval_utils.py, utils.py
instruction_dataset/__init__.py, mt_bench/mt_bench_utils.py

Tests (`tests/test_logging.py`)

Behaviour

Default behaviour (-v 0) matches existing output — INFO messages print to stderr just like the old print() calls. No visible change for users who don't pass new flags.

Example Log

2026-04-14 10:00:37 [INFO] judgearena.__main__: Using dataset alpaca-eval and evaluating models VLLM/Qwen/Qwen2.5-0.5B-Instruct and VLLM/Qwen/Qwen2.5-1.5B-Instruct.
2026-04-14 10:00:37 [INFO] judgearena.instruction_dataset: Loaded 805 instructions for alpaca-eval.
2026-04-14 10:00:37 [INFO] judgearena.__main__: Generating completions for dataset alpaca-eval with model VLLM/Qwen/Qwen2.5-0.5B-Instruct and VLLM/Qwen/Qwen2.5-1.5B-Instruct (or loading them directly if present)
2026-04-14 10:00:40 [INFO] judgearena.utils: Loading cache /leonardo_work/OELLM_prod2026/users/bkargi00/openjury-eval-data/cache/alpaca-eval_VLLM/Qwen/Qwen2.5-0.5B-Instruct_25.csv.zip
2026-04-14 10:00:43 [INFO] judgearena.utils: Loading cache /leonardo_work/OELLM_prod2026/users/bkargi00/openjury-eval-data/cache/alpaca-eval_VLLM/Qwen/Qwen2.5-1.5B-Instruct_25.csv.zip
2026-04-14 10:00:43 [DEBUG] judgearena.__main__: First instruction/context: "I am trying to win over a new client for my writing services and skinny brown dog media to as as a ghost writer for their book Unbreakable Confidence. Can you help me write a persuasive proposal that highlights the benefits and value of having a editor/publisher"
2026-04-14 10:00:43 [DEBUG] judgearena.__main__: First completion of VLLM/Qwen/Qwen2.5-0.5B-Instruct:

kargibora added 4 commits April 14, 2026 12:07

feat: add structured logging module and tests

2874dc2

feat: add -v, -q, --log-file, --no-log-file CLI flags

5cbc880

refactor: replace print() with structured logger across codebase

f34d437

Remove redundant tests

8c4cc36

kargibora requested review from ErlisLushtaku and geoalgo April 15, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: structured logging#37

feat: structured logging#37
kargibora wants to merge 4 commits intomainfrom
feat/structured-logging

kargibora commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kargibora commented Apr 15, 2026

Summary

Changes

New: judgearena/log.py

New CLI flags (judgearena/cli_common.py)

Codebase migration

Tests (tests/test_logging.py)

Behaviour

Example Log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New: `judgearena/log.py`

New CLI flags (`judgearena/cli_common.py`)

Tests (`tests/test_logging.py`)