adding evaluation and readme for rh-virt by GuyZivRH · Pull Request #33 · RHEcosystemAppEng/agentic-collections

GuyZivRH · 2026-03-12T11:31:12Z

Summary

Pack(s) affected

rh-virt

Change type

Update existing skill / agent
Docs / README

Copilot

Pull request overview

Adds an RH-Virt “monolit” evaluation task (Sonnet + Haiku instruction variants) to benchmark skill usage against a mocked OpenShift Virtualization MCP server, using deterministic pytest checks plus an LLM-as-judge pass/fail rubric.

Changes:

Introduces a new SkillsBench-style task bundle for RH-Virt fleet operations (clone, migrate, provision, decommission) with instructions + verifier config.
Adds deterministic pytest validations for required deliverables and key expected values.
Adds an Anthropic-backed LLM judge to score skill-specific knowledge, plus a mock OpenShift Virtualization MCP server and supporting Dockerfiles.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
rh-virt/evaluation/sonnet/monolit/tests/test.sh	Verifier entrypoint to install deps, run pytest + optional LLM judge, and emit reward/artifacts.
rh-virt/evaluation/sonnet/monolit/tests/test_outputs.py	Deterministic pytest suite validating required files and key outputs for all 4 parts.
rh-virt/evaluation/sonnet/monolit/tests/llm_judge.py	LLM-as-judge criteria evaluation and JSON scoring output.
rh-virt/evaluation/sonnet/monolit/task.toml	Task metadata + verifier/agent env configuration (including judge env vars).
rh-virt/evaluation/sonnet/monolit/instruction.md	Sonnet task instructions for the 4-part scenario.
rh-virt/evaluation/README.md	Overview of the evaluation approach, structure, and how to run Harbor sweeps.
rh-virt/evaluation/haiku/README.md	Documentation describing how Haiku variants differ from Sonnet.
rh-virt/evaluation/haiku/monolit/instruction.md	Haiku instruction variant with stronger “read skills/docs” reminders.
rh-virt/evaluation/common/mcp-servers/mock-virt-mcp.py	Mock MCP server implementing config/core/kubevirt tool interfaces with simulated cluster data.
rh-virt/evaluation/common/Dockerfile.no-skills	Container build for the “no-skills” evaluation variant.
rh-virt/evaluation/common/Dockerfile	Container build for the “with-skills” evaluation variant (copies skills/docs into standard locations).

dmartinol

Thanks for your contribution!
Before merging I think we need to agree on a way to let the readers replicate the same evaluations.

Resolve test files relative to the script directory instead of expensive find / calls. Create /logs/verifier before writing. Remove unused pytest_exit variable.

Change default LLM_JUDGE_MODEL from claude-haiku-4-5 to claude-sonnet-4-5. Create /logs/verifier dir before writing results. Clarify docstring that LLM_JUDGE_MODEL is optional.

Explain that Haiku folders contain only instruction.md files and that Harbor/SkillsBench require a full self-contained task folder when running evaluations.

Clarify the actual directory layout, explain that Haiku reuses Sonnet tests, add step-by-step reproduction instructions, and include a summary of experiment results.

Remove per-section REMINDER blocks that name specific skills to avoid biasing the agent. Remove explicit MCP tool reference from Part 3 — the agent should discover tools implicitly. Keep general skill-awareness prompts at top and bottom only.

Rename CLONE_JSON to CLONED_VM_SPEC and CREATE_JSON to VM_CREATE_PARAMS to better reflect their contents.

dmartinol · 2026-03-17T08:53:44Z

+
+### Prerequisites
+
+1. [Harbor](https://github.com/redhat-et/agent-frameworks-bench) — the benchmark runner


The actual URL is: https://github.com/harbor-framework/harbor

r2dedios

Great job! I just have three tiny comments

r2dedios · 2026-03-12T16:24:42Z

+
+**Deliverables:**
+
+- **`migration_assessment.md`** — Document:


In order to don't mix concepts and help the AIAgent to differentiate skills/tools/tasks we are using the following terms:

Migration:* Migrating a VM from a different Virtualization provider (like vmware) to OCP Virtualization

Rebalance:* For re-scheduling VMs in a different node.

So just wondering if this could confuse our agents

r2dedios · 2026-03-24T12:13:14Z

+name = "RH-Virt Inventory and Advisory Analysis"
+difficulty = "medium"
+category = "virtualization-operations"
+tags = ["python", "red-hat", "virtualization", "mcp", "rhv", "oVirt", "skills"]


I would replace oVirt by VMWare. We are not considering oVirt by the moment

r2dedios · 2026-04-10T11:03:14Z

ping

r2dedios · 2026-04-13T16:49:43Z

@GuyZivRH Any news about this PR? Thanks

adding evaluation and readme for rh-virt

c5ba8b7

GuyZivRH requested review from dmartinol and r2dedios March 12, 2026 11:31

dmartinol requested a review from Copilot March 12, 2026 12:38

Copilot started reviewing on behalf of dmartinol March 12, 2026 12:38 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

dmartinol reviewed Mar 12, 2026

View reviewed changes

GuyZivRH added 7 commits March 15, 2026 15:43

fix(test.sh): use relative paths and ensure output dir exists

b7ce44d

Resolve test files relative to the script directory instead of expensive find / calls. Create /logs/verifier before writing. Remove unused pytest_exit variable.

fix(llm_judge): default to sonnet, ensure output dir exists

dd5698c

Change default LLM_JUDGE_MODEL from claude-haiku-4-5 to claude-sonnet-4-5. Create /logs/verifier dir before writing results. Clarify docstring that LLM_JUDGE_MODEL is optional.

fix(haiku/README): clarify only differing files are included

9716054

Explain that Haiku folders contain only instruction.md files and that Harbor/SkillsBench require a full self-contained task folder when running evaluations.

fix(task.toml): update author to Guy Ziv

0c9fbfd

fix(README): add folder structure, repro steps, and results summary

490e0a3

Clarify the actual directory layout, explain that Haiku reuses Sonnet tests, add step-by-step reproduction instructions, and include a summary of experiment results.

fix(test_outputs): rename misleading variable names

daa684c

Rename CLONE_JSON to CLONED_VM_SPEC and CREATE_JSON to VM_CREATE_PARAMS to better reflect their contents.

GuyZivRH requested a review from dmartinol March 15, 2026 14:40

dmartinol reviewed Mar 17, 2026

View reviewed changes

r2dedios reviewed Mar 24, 2026

View reviewed changes

r2dedios assigned GuyZivRH Mar 24, 2026

r2dedios added the enhancement New feature or request label Mar 24, 2026


		### Prerequisites

		1. [Harbor](https://github.com/redhat-et/agent-frameworks-bench) — the benchmark runner


		Deliverables:

		- `migration_assessment.md` — Document:

Conversation

GuyZivRH commented Mar 12, 2026 • edited by dmartinol Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pack(s) affected

Change type

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmartinol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmartinol Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

r2dedios left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

r2dedios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

r2dedios Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

r2dedios commented Apr 10, 2026

Uh oh!

r2dedios commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GuyZivRH commented Mar 12, 2026 •

edited by dmartinol

Loading