Skip to content

Appeng 4747/run eval per skill#49

Open
GuyZivRH wants to merge 2 commits intomainfrom
APPENG-4747/run_eval_per_skill
Open

Appeng 4747/run eval per skill#49
GuyZivRH wants to merge 2 commits intomainfrom
APPENG-4747/run_eval_per_skill

Conversation

@GuyZivRH
Copy link
Copy Markdown

Summary

Pack(s) affected

  • rh-sre
  • rh-developer
  • ocp-admin
  • rh-virt
  • rh-ai-engineer
  • Other / repo-wide

Change type

  • New skill
  • New agent
  • New pack
  • Update existing skill / agent
  • MCP server config (.mcp.json)
  • Docs / README
  • CI / tooling

CLAUDE.md compliance

  • Agents orchestrate skills; no direct MCP/tool calls in agents
  • Skills are single-purpose task executors
  • Skills encapsulate all tool access (MCP tools invoked only inside skills)
  • Document consultation: file is read with the Read tool, then declared to the user
  • No credentials hardcoded; env vars used via ${VAR} references
  • Human-in-the-loop confirmation added for any destructive or critical operations

Validation

  • make validate passes locally
  • New/changed skills have valid YAML frontmatter (name, description)
  • New/changed agents have valid YAML frontmatter (name, description)

…s/ directories

from both with_skills/ and without_skills/ evaluation environments. These folders
are not needed in the evaluation setup.
@dmartinol dmartinol self-requested a review March 24, 2026 11:05
Copy link
Copy Markdown
Collaborator

@dmartinol dmartinol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work!
My concerns:

  • Most of the files are identical b/w with_skills and without_skills folders (for a given test), we should try to minimize the duplications
  • I can't see instructions to run the evaluations. Are they in the other PR that you raised?


Document your methodology, impact analysis, and risk assessment in `/root/report.md`.

Use MCP tools to query vulnerability data. If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop to ask for user confirmation or input at any checkpoint. Use reasonable defaults (e.g., fetch all available data) and proceed through every step to produce the final report.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we remove the MCP instructions from the without_skills tests?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so claude opus suggested to keep for fairness and also stated 2 things: 1) i did a run without those instructions and the performance was the same for the unskilled agent 2) token wise/speed this is minor

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this one is not generated from within the Dockerfile as I see in other tests?
example:

RUN echo '{ \
  "mcpServers": { \
    "lightspeed-mcp": { \
      "command": "python3", \
      "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \
    } \
  } \
}' > /root/.mcp.json

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will check that.

@GuyZivRH
Copy link
Copy Markdown
Author

I agree with @dmartinol remark about duplications in here so i will elaborate:
This is 1st edition for PR for usage of the team. The presence of file (duplicates) will ease the user to replicate results if they would like to run eval. The final PR will be far more concise.
i will mark this as draft and if you still find the duplications too annoying or there is no real need to replicate the results on your end i can attend this de-duplication right away.

regarding the absence of instructions file on how to eval, I stated in the slack channel and attached the skillsbench url and an example sweep file. I can add a file here (the draft) or attach one after the de-duplication process

@GuyZivRH
Copy link
Copy Markdown
Author

GuyZivRH commented Mar 24, 2026

i will clarify after checking the files

Task-Specific Files (228 - CANNOT deduplicate):

  • instruction.md - 45 (unique per task)
  • task.toml - 45 (unique per task)
  • solve.sh - 45 (unique per task)
  • llm_judge.py - 45 (unique per task)
  • test_outputs.py - 45 (unique per task)
  • PER_SKILL_REVIEW_REPORT.md - 1
  • SKILL_PATH_FIXES.md - 1
  • .mcp.json - 1

Shared Files (44 - after deduplication):

  • test.sh - 1 (currently 90 identical copies)
  • Dockerfiles - 13 variants (currently 92 files)
  • Mock MCP servers - 11 unique (currently ~120 files)
  • K8s/Helm templates - 12 (currently 88 files)
    • 6 single variants (imagestream, buildconfig, values, Chart, NOTES, _helpers)
    • 6 as pairs (service×2, route×2, deployment×2)
  • Systemd configs - 3 variants (currently 24 files)
  • app.py - 1 (currently 2)
  • test_app.py - 1 (currently 2)
  • requirements.txt - 1 (currently 2)
  • .s2i/environment - 1 (currently 2)

Summary:

  • Current: 885 files
  • Minimal: 272 files
  • Savings: 613 files (~70% reduction)

@r2dedios r2dedios added the enhancement New feature or request label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants