Skip to content

feat: structured logging#37

Open
kargibora wants to merge 4 commits intomainfrom
feat/structured-logging
Open

feat: structured logging#37
kargibora wants to merge 4 commits intomainfrom
feat/structured-logging

Conversation

@kargibora
Copy link
Copy Markdown
Collaborator

Summary

Replace ad-hoc print() calls across the codebase with Python's logging module under a unified judgearena logger namespace. This gives users control over verbosity and enables persistent debug logs without code changes.

I totally think this is required and its better to switch to using logging package than late as its common practice (also development friendly). The only concern is the log files. It is usually hard to read log directly from console so its better to save log in a rotation for debugging and results.

Changes

New: judgearena/log.py

  • get_logger(name) — returns a child logger under the judgearena namespace; auto-prefixes bare module names.
  • configure_logging(verbosity, log_file) — sets up console + optional file handlers. Supports JUDGEARENA_LOG_LEVEL env-var override.
  • attach_file_handler(path) — adds a DEBUG-level file handler (always captures full trace regardless of console verbosity).
  • make_run_log_path(folder) — generates a timestamped run-YYYYMMDD_HHMMSS.log path.

New CLI flags (judgearena/cli_common.py)

Flag Effect
-v / --verbose Set console to DEBUG
-q / --quiet Suppress everything below WARNING
--log-file PATH Explicit log file location
--no-log-file Disable automatic run-*.log in the result folder

Added resolve_verbosity(args) helper — -q takes precedence over -v.

Codebase migration

Replaced print() with logger.info() / logger.debug() in:

  • evaluate.py, generate_and_evaluate.py, estimate_elo_ratings.py
  • arenas_utils.py, eval_utils.py, utils.py
  • instruction_dataset/__init__.py, mt_bench/mt_bench_utils.py

Tests (tests/test_logging.py)

Behaviour

Default behaviour (-v 0) matches existing output — INFO messages print to stderr just like the old print() calls. No visible change for users who don't pass new flags.

Example Log

2026-04-14 10:00:37 [INFO] judgearena.__main__: Using dataset alpaca-eval and evaluating models VLLM/Qwen/Qwen2.5-0.5B-Instruct and VLLM/Qwen/Qwen2.5-1.5B-Instruct.
2026-04-14 10:00:37 [INFO] judgearena.instruction_dataset: Loaded 805 instructions for alpaca-eval.
2026-04-14 10:00:37 [INFO] judgearena.__main__: Generating completions for dataset alpaca-eval with model VLLM/Qwen/Qwen2.5-0.5B-Instruct and VLLM/Qwen/Qwen2.5-1.5B-Instruct (or loading them directly if present)
2026-04-14 10:00:40 [INFO] judgearena.utils: Loading cache /leonardo_work/OELLM_prod2026/users/bkargi00/openjury-eval-data/cache/alpaca-eval_VLLM/Qwen/Qwen2.5-0.5B-Instruct_25.csv.zip
2026-04-14 10:00:43 [INFO] judgearena.utils: Loading cache /leonardo_work/OELLM_prod2026/users/bkargi00/openjury-eval-data/cache/alpaca-eval_VLLM/Qwen/Qwen2.5-1.5B-Instruct_25.csv.zip
2026-04-14 10:00:43 [DEBUG] judgearena.__main__: First instruction/context: "I am trying to win over a new client for my writing services and skinny brown dog media to as as a ghost writer for their book Unbreakable Confidence. Can you help me write a persuasive proposal that highlights the benefits and value of having a editor/publisher"
2026-04-14 10:00:43 [DEBUG] judgearena.__main__: First completion of VLLM/Qwen/Qwen2.5-0.5B-Instruct:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant