Appeng 4747/run eval per skill by GuyZivRH · Pull Request #49 · RHEcosystemAppEng/agentic-collections

GuyZivRH · 2026-03-24T07:29:07Z

Summary

Pack(s) affected

Change type

CLAUDE.md compliance

Agents orchestrate skills; no direct MCP/tool calls in agents
Skills are single-purpose task executors
Skills encapsulate all tool access (MCP tools invoked only inside skills)
Document consultation: file is read with the Read tool, then declared to the user
No credentials hardcoded; env vars used via ${VAR} references
Human-in-the-loop confirmation added for any destructive or critical operations

Validation

make validate passes locally
New/changed skills have valid YAML frontmatter (name, description)
New/changed agents have valid YAML frontmatter (name, description)

…s/ directories from both with_skills/ and without_skills/ evaluation environments. These folders are not needed in the evaluation setup.

dmartinol

Thanks for the great work!
My concerns:

Most of the files are identical b/w with_skills and without_skills folders (for a given test), we should try to minimize the duplications
I can't see instructions to run the evaluations. Are they in the other PR that you raised?

dmartinol · 2026-03-24T13:14:08Z

+
+Document your methodology, impact analysis, and risk assessment in `/root/report.md`.
+
+Use MCP tools to query vulnerability data. If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop to ask for user confirmation or input at any checkpoint. Use reasonable defaults (e.g., fetch all available data) and proceed through every step to produce the final report.


shouldn't we remove the MCP instructions from the without_skills tests?

so claude opus suggested to keep for fairness and also stated 2 things: 1) i did a run without those instructions and the performance was the same for the unskilled agent 2) token wise/speed this is minor

let's keep it then

dmartinol · 2026-03-24T14:09:16Z

why this one is not generated from within the Dockerfile as I see in other tests?
example:

RUN echo '{ \ "mcpServers": { \ "lightspeed-mcp": { \ "command": "python3", \ "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ } \ } \ }' > /root/.mcp.json

will check that.

GuyZivRH · 2026-03-24T14:31:56Z

I agree with @dmartinol remark about duplications in here so i will elaborate:
This is 1st edition for PR for usage of the team. The presence of file (duplicates) will ease the user to replicate results if they would like to run eval. The final PR will be far more concise.
i will mark this as draft and if you still find the duplications too annoying or there is no real need to replicate the results on your end i can attend this de-duplication right away.

regarding the absence of instructions file on how to eval, I stated in the slack channel and attached the skillsbench url and an example sweep file. I can add a file here (the draft) or attach one after the de-duplication process

GuyZivRH · 2026-03-24T15:07:19Z

i will clarify after checking the files

Task-Specific Files (228 - CANNOT deduplicate):

instruction.md - 45 (unique per task)
task.toml - 45 (unique per task)
solve.sh - 45 (unique per task)
llm_judge.py - 45 (unique per task)
test_outputs.py - 45 (unique per task)
PER_SKILL_REVIEW_REPORT.md - 1
SKILL_PATH_FIXES.md - 1
.mcp.json - 1

Shared Files (44 - after deduplication):

test.sh - 1 (currently 90 identical copies)
Dockerfiles - 13 variants (currently 92 files)
Mock MCP servers - 11 unique (currently ~120 files)
K8s/Helm templates - 12 (currently 88 files)
- 6 single variants (imagestream, buildconfig, values, Chart, NOTES, _helpers)
- 6 as pairs (service×2, route×2, deployment×2)
Systemd configs - 3 variants (currently 24 files)
app.py - 1 (currently 2)
test_app.py - 1 (currently 2)
requirements.txt - 1 (currently 2)
.s2i/environment - 1 (currently 2)

Summary:

Current: 885 files
Minimal: 272 files
Savings: 613 files (~70% reduction)

GuyZivRH added 2 commits March 24, 2026 09:16

adding tasks for skill evaluation (per skill)

8b9ccac

Cleaned up environment folders by removing docs/, skills/, and script…

d31d0c4

…s/ directories from both with_skills/ and without_skills/ evaluation environments. These folders are not needed in the evaluation setup.

dmartinol self-requested a review March 24, 2026 11:05

dmartinol reviewed Mar 24, 2026

View reviewed changes

r2dedios assigned GuyZivRH Mar 24, 2026

r2dedios added the enhancement New feature or request label Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appeng 4747/run eval per skill#49

Appeng 4747/run eval per skill#49
GuyZivRH wants to merge 2 commits intomainfrom
APPENG-4747/run_eval_per_skill

GuyZivRH commented Mar 24, 2026

Uh oh!

dmartinol left a comment

Uh oh!

dmartinol Mar 24, 2026

Uh oh!

GuyZivRH Mar 24, 2026

Uh oh!

dmartinol Mar 24, 2026

Uh oh!

dmartinol Mar 24, 2026

Uh oh!

GuyZivRH Mar 24, 2026

Uh oh!

GuyZivRH commented Mar 24, 2026

Uh oh!

GuyZivRH commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Document your methodology, impact analysis, and risk assessment in `/root/report.md`.

		Use MCP tools to query vulnerability data. If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop to ask for user confirmation or input at any checkpoint. Use reasonable defaults (e.g., fetch all available data) and proceed through every step to produce the final report.

Conversation

GuyZivRH commented Mar 24, 2026

Summary

Pack(s) affected

Change type

CLAUDE.md compliance

Validation

Uh oh!

dmartinol left a comment

Choose a reason for hiding this comment

Uh oh!

dmartinol Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

GuyZivRH Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

dmartinol Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

dmartinol Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

GuyZivRH Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

GuyZivRH commented Mar 24, 2026

Uh oh!

GuyZivRH commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GuyZivRH commented Mar 24, 2026 •

edited

Loading