Add directory-style agent evaluation workflow and helper scripts by wzzll123 · Pull Request #16 · wzzll123/MultiKernelBench

wzzll123 · 2026-03-23T08:10:50Z

Provide a minimal, directory-based workflow for running code-generation agents against per-op tasks and evaluating outputs with the existing MultiKernelBench evaluator.
Make it simple to prepare isolated task workspaces that contain only reference.py and agent_prompt.txt so agents write a single final_response.txt.
Automate launching agents over many tasks and collecting structured evaluation summaries to speed up large-scale agent experiments.

Add scripts/agent_eval/prepare_workspaces.py which creates per-op task directories with reference.py, agent_prompt.txt, and a manifest.json, and supports --readonly to mark files read-only.
Add scripts/agent_eval/launch_agent.py which runs an agent command template across task directories, supports placeholders ({prompt_file}, {prompt}, {workdir}), optional --auto-yolo behaviour, writes agent_run.json per task, and an overall agent_launch_summary.json.
Add scripts/agent_eval/collect_results.py which invokes eval_single_runner.py on each agent-produced final_response.txt, writes eval_result.json per task, and emits aggregated JSON/CSV summaries.
Add scripts/agent_eval/README.md and update top-level README.md with instructions and example invocations showing the three-step flow (prepare_workspaces.py, launch_agent.py, collect_results.py).

Generate agent prompts dynamically from prompt_generators

09c9438

wzzll123 added the codex label Mar 23, 2026 — with ChatGPT Codex Connector

Clarify max-tasks purpose in CLI help and docs

4dcab22

Provide feedback