Conversation
There was a problem hiding this comment.
Pull request overview
Adds an RH-Virt “monolit” evaluation task (Sonnet + Haiku instruction variants) to benchmark skill usage against a mocked OpenShift Virtualization MCP server, using deterministic pytest checks plus an LLM-as-judge pass/fail rubric.
Changes:
- Introduces a new SkillsBench-style task bundle for RH-Virt fleet operations (clone, migrate, provision, decommission) with instructions + verifier config.
- Adds deterministic pytest validations for required deliverables and key expected values.
- Adds an Anthropic-backed LLM judge to score skill-specific knowledge, plus a mock OpenShift Virtualization MCP server and supporting Dockerfiles.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| rh-virt/evaluation/sonnet/monolit/tests/test.sh | Verifier entrypoint to install deps, run pytest + optional LLM judge, and emit reward/artifacts. |
| rh-virt/evaluation/sonnet/monolit/tests/test_outputs.py | Deterministic pytest suite validating required files and key outputs for all 4 parts. |
| rh-virt/evaluation/sonnet/monolit/tests/llm_judge.py | LLM-as-judge criteria evaluation and JSON scoring output. |
| rh-virt/evaluation/sonnet/monolit/task.toml | Task metadata + verifier/agent env configuration (including judge env vars). |
| rh-virt/evaluation/sonnet/monolit/instruction.md | Sonnet task instructions for the 4-part scenario. |
| rh-virt/evaluation/README.md | Overview of the evaluation approach, structure, and how to run Harbor sweeps. |
| rh-virt/evaluation/haiku/README.md | Documentation describing how Haiku variants differ from Sonnet. |
| rh-virt/evaluation/haiku/monolit/instruction.md | Haiku instruction variant with stronger “read skills/docs” reminders. |
| rh-virt/evaluation/common/mcp-servers/mock-virt-mcp.py | Mock MCP server implementing config/core/kubevirt tool interfaces with simulated cluster data. |
| rh-virt/evaluation/common/Dockerfile.no-skills | Container build for the “no-skills” evaluation variant. |
| rh-virt/evaluation/common/Dockerfile | Container build for the “with-skills” evaluation variant (copies skills/docs into standard locations). |
dmartinol
left a comment
There was a problem hiding this comment.
Thanks for your contribution!
Before merging I think we need to agree on a way to let the readers replicate the same evaluations.
Resolve test files relative to the script directory instead of expensive find / calls. Create /logs/verifier before writing. Remove unused pytest_exit variable.
Change default LLM_JUDGE_MODEL from claude-haiku-4-5 to claude-sonnet-4-5. Create /logs/verifier dir before writing results. Clarify docstring that LLM_JUDGE_MODEL is optional.
Explain that Haiku folders contain only instruction.md files and that Harbor/SkillsBench require a full self-contained task folder when running evaluations.
Clarify the actual directory layout, explain that Haiku reuses Sonnet tests, add step-by-step reproduction instructions, and include a summary of experiment results.
Remove per-section REMINDER blocks that name specific skills to avoid biasing the agent. Remove explicit MCP tool reference from Part 3 — the agent should discover tools implicitly. Keep general skill-awareness prompts at top and bottom only.
Rename CLONE_JSON to CLONED_VM_SPEC and CREATE_JSON to VM_CREATE_PARAMS to better reflect their contents.
|
|
||
| ### Prerequisites | ||
|
|
||
| 1. [Harbor](https://github.com/redhat-et/agent-frameworks-bench) — the benchmark runner |
There was a problem hiding this comment.
The actual URL is: https://github.com/harbor-framework/harbor
r2dedios
left a comment
There was a problem hiding this comment.
Great job! I just have three tiny comments
|
|
||
| **Deliverables:** | ||
|
|
||
| - **`migration_assessment.md`** — Document: |
There was a problem hiding this comment.
In order to don't mix concepts and help the AIAgent to differentiate skills/tools/tasks we are using the following terms:
- Migration:* Migrating a VM from a different Virtualization provider (like vmware) to OCP Virtualization
- Rebalance:* For re-scheduling VMs in a different node.
So just wondering if this could confuse our agents
| name = "RH-Virt Inventory and Advisory Analysis" | ||
| difficulty = "medium" | ||
| category = "virtualization-operations" | ||
| tags = ["python", "red-hat", "virtualization", "mcp", "rhv", "oVirt", "skills"] |
There was a problem hiding this comment.
I would replace oVirt by VMWare. We are not considering oVirt by the moment
|
ping |
|
@GuyZivRH Any news about this PR? Thanks |
Summary
Pack(s) affected
rh-virtChange type