Conversation
…s/ directories from both with_skills/ and without_skills/ evaluation environments. These folders are not needed in the evaluation setup.
dmartinol
left a comment
There was a problem hiding this comment.
Thanks for the great work!
My concerns:
- Most of the files are identical b/w with_skills and without_skills folders (for a given test), we should try to minimize the duplications
- I can't see instructions to run the evaluations. Are they in the other PR that you raised?
|
|
||
| Document your methodology, impact analysis, and risk assessment in `/root/report.md`. | ||
|
|
||
| Use MCP tools to query vulnerability data. If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop to ask for user confirmation or input at any checkpoint. Use reasonable defaults (e.g., fetch all available data) and proceed through every step to produce the final report. |
There was a problem hiding this comment.
shouldn't we remove the MCP instructions from the without_skills tests?
There was a problem hiding this comment.
so claude opus suggested to keep for fairness and also stated 2 things: 1) i did a run without those instructions and the performance was the same for the unskilled agent 2) token wise/speed this is minor
There was a problem hiding this comment.
why this one is not generated from within the Dockerfile as I see in other tests?
example:
RUN echo '{ \
"mcpServers": { \
"lightspeed-mcp": { \
"command": "python3", \
"args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \
} \
} \
}' > /root/.mcp.json
|
I agree with @dmartinol remark about duplications in here so i will elaborate: regarding the absence of instructions file on how to eval, I stated in the slack channel and attached the skillsbench url and an example sweep file. I can add a file here (the draft) or attach one after the de-duplication process |
|
i will clarify after checking the files Task-Specific Files (228 - CANNOT deduplicate):
Shared Files (44 - after deduplication):
Summary:
|
Summary
Pack(s) affected
rh-srerh-developerocp-adminrh-virtrh-ai-engineerChange type
.mcp.json)CLAUDE.md compliance
${VAR}referencesValidation
make validatepasses locallyname,description)name,description)