Skip to content

Add directory-style agent evaluation workflow and helper scripts#16

Open
wzzll123 wants to merge 2 commits intomainfrom
codex/add-agent-evaluation-metrics-vdte8r
Open

Add directory-style agent evaluation workflow and helper scripts#16
wzzll123 wants to merge 2 commits intomainfrom
codex/add-agent-evaluation-metrics-vdte8r

Conversation

@wzzll123
Copy link
Owner

Motivation

  • Provide a minimal, directory-based workflow for running code-generation agents against per-op tasks and evaluating outputs with the existing MultiKernelBench evaluator.
  • Make it simple to prepare isolated task workspaces that contain only reference.py and agent_prompt.txt so agents write a single final_response.txt.
  • Automate launching agents over many tasks and collecting structured evaluation summaries to speed up large-scale agent experiments.

Description

  • Add scripts/agent_eval/prepare_workspaces.py which creates per-op task directories with reference.py, agent_prompt.txt, and a manifest.json, and supports --readonly to mark files read-only.
  • Add scripts/agent_eval/launch_agent.py which runs an agent command template across task directories, supports placeholders ({prompt_file}, {prompt}, {workdir}), optional --auto-yolo behaviour, writes agent_run.json per task, and an overall agent_launch_summary.json.
  • Add scripts/agent_eval/collect_results.py which invokes eval_single_runner.py on each agent-produced final_response.txt, writes eval_result.json per task, and emits aggregated JSON/CSV summaries.
  • Add scripts/agent_eval/README.md and update top-level README.md with instructions and example invocations showing the three-step flow (prepare_workspaces.py, launch_agent.py, collect_results.py).

Testing

  • No automated tests were added or run as part of this change.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant