Skip to content

adding evaluation and readme for rh-virt#33

Open
GuyZivRH wants to merge 8 commits intomainfrom
APPENG-4717/merge_eval_to_rh_virt
Open

adding evaluation and readme for rh-virt#33
GuyZivRH wants to merge 8 commits intomainfrom
APPENG-4717/merge_eval_to_rh_virt

Conversation

@GuyZivRH
Copy link
Copy Markdown

@GuyZivRH GuyZivRH commented Mar 12, 2026

Summary

Pack(s) affected

  • rh-virt

Change type

  • Update existing skill / agent
  • Docs / README

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an RH-Virt “monolit” evaluation task (Sonnet + Haiku instruction variants) to benchmark skill usage against a mocked OpenShift Virtualization MCP server, using deterministic pytest checks plus an LLM-as-judge pass/fail rubric.

Changes:

  • Introduces a new SkillsBench-style task bundle for RH-Virt fleet operations (clone, migrate, provision, decommission) with instructions + verifier config.
  • Adds deterministic pytest validations for required deliverables and key expected values.
  • Adds an Anthropic-backed LLM judge to score skill-specific knowledge, plus a mock OpenShift Virtualization MCP server and supporting Dockerfiles.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
rh-virt/evaluation/sonnet/monolit/tests/test.sh Verifier entrypoint to install deps, run pytest + optional LLM judge, and emit reward/artifacts.
rh-virt/evaluation/sonnet/monolit/tests/test_outputs.py Deterministic pytest suite validating required files and key outputs for all 4 parts.
rh-virt/evaluation/sonnet/monolit/tests/llm_judge.py LLM-as-judge criteria evaluation and JSON scoring output.
rh-virt/evaluation/sonnet/monolit/task.toml Task metadata + verifier/agent env configuration (including judge env vars).
rh-virt/evaluation/sonnet/monolit/instruction.md Sonnet task instructions for the 4-part scenario.
rh-virt/evaluation/README.md Overview of the evaluation approach, structure, and how to run Harbor sweeps.
rh-virt/evaluation/haiku/README.md Documentation describing how Haiku variants differ from Sonnet.
rh-virt/evaluation/haiku/monolit/instruction.md Haiku instruction variant with stronger “read skills/docs” reminders.
rh-virt/evaluation/common/mcp-servers/mock-virt-mcp.py Mock MCP server implementing config/core/kubevirt tool interfaces with simulated cluster data.
rh-virt/evaluation/common/Dockerfile.no-skills Container build for the “no-skills” evaluation variant.
rh-virt/evaluation/common/Dockerfile Container build for the “with-skills” evaluation variant (copies skills/docs into standard locations).

Comment thread rh-virt/evaluation/sonnet/monolit/tests/test.sh
Comment thread rh-virt/evaluation/sonnet/monolit/tests/test.sh Outdated
Comment thread rh-virt/evaluation/sonnet/monolit/tests/test.sh Outdated
Comment thread rh-virt/evaluation/sonnet/monolit/tests/llm_judge.py
Comment thread rh-virt/evaluation/sonnet/monolit/tests/llm_judge.py Outdated
Comment thread rh-virt/evaluation/haiku/README.md Outdated
Copy link
Copy Markdown
Collaborator

@dmartinol dmartinol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!
Before merging I think we need to agree on a way to let the readers replicate the same evaluations.

Comment thread rh-virt/evaluation/haiku/monolit/instruction.md
Comment thread rh-virt/evaluation/haiku/monolit/instruction.md Outdated
Comment thread rh-virt/evaluation/sonnet/monolit/tests/test_outputs.py Outdated
Comment thread rh-virt/evaluation/sonnet/monolit/task.toml Outdated
Comment thread rh-virt/evaluation/sonnet/monolit/task.toml
Comment thread rh-virt/evaluation/README.md Outdated
Comment thread rh-virt/evaluation/README.md
Comment thread rh-virt/evaluation/README.md
Comment thread rh-virt/evaluation/README.md Outdated
Resolve test files relative to the script directory instead of
expensive find / calls. Create /logs/verifier before writing.
Remove unused pytest_exit variable.
Change default LLM_JUDGE_MODEL from claude-haiku-4-5 to
claude-sonnet-4-5. Create /logs/verifier dir before writing
results. Clarify docstring that LLM_JUDGE_MODEL is optional.
Explain that Haiku folders contain only instruction.md files
and that Harbor/SkillsBench require a full self-contained task
folder when running evaluations.
Clarify the actual directory layout, explain that Haiku reuses
Sonnet tests, add step-by-step reproduction instructions, and
include a summary of experiment results.
Remove per-section REMINDER blocks that name specific skills
to avoid biasing the agent. Remove explicit MCP tool reference
from Part 3 — the agent should discover tools implicitly.
Keep general skill-awareness prompts at top and bottom only.
Rename CLONE_JSON to CLONED_VM_SPEC and CREATE_JSON to
VM_CREATE_PARAMS to better reflect their contents.
@GuyZivRH GuyZivRH requested a review from dmartinol March 15, 2026 14:40

### Prerequisites

1. [Harbor](https://github.com/redhat-et/agent-frameworks-bench) — the benchmark runner
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@r2dedios r2dedios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! I just have three tiny comments

Comment thread rh-virt/evaluation/haiku/monolit/instruction.md

**Deliverables:**

- **`migration_assessment.md`** — Document:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to don't mix concepts and help the AIAgent to differentiate skills/tools/tasks we are using the following terms:

  • Migration:* Migrating a VM from a different Virtualization provider (like vmware) to OCP Virtualization
  • Rebalance:* For re-scheduling VMs in a different node.

So just wondering if this could confuse our agents

name = "RH-Virt Inventory and Advisory Analysis"
difficulty = "medium"
category = "virtualization-operations"
tags = ["python", "red-hat", "virtualization", "mcp", "rhv", "oVirt", "skills"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace oVirt by VMWare. We are not considering oVirt by the moment

@r2dedios r2dedios added the enhancement New feature or request label Mar 24, 2026
@r2dedios
Copy link
Copy Markdown
Contributor

ping

@r2dedios
Copy link
Copy Markdown
Contributor

@GuyZivRH Any news about this PR? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants