Skip to content

Add skills for C# MCP Server Development#317

Merged
danmoseley merged 40 commits intodotnet:mainfrom
leslierichardson95:lerich/mcp-skills
Mar 25, 2026
Merged

Add skills for C# MCP Server Development#317
danmoseley merged 40 commits intodotnet:mainfrom
leslierichardson95:lerich/mcp-skills

Conversation

@leslierichardson95
Copy link
Copy Markdown
Contributor

This pull request introduces comprehensive documentation for creating MCP servers (creating, debugging, testing, publishing) using the C# SDK and .NET project templates. These documents provide step-by-step instructions, attribute references, implementation patterns, and advanced configuration guidance for developers building, debugging, testing, and publishing MCP server projects.

Reference documentation for implementation and configuration:

  • Added references/api-patterns.md, detailing attribute usage, tool return types, dependency injection, builder API, dynamic tool creation, server options, experimental APIs, and NuGet package selection.
  • Added references/transport-config.md, covering stdio and HTTP transport setup, custom path prefixes, stateless mode, authentication/authorization, accessing HTTP context, OAuth flows, idle timeout, port configuration, and OpenTelemetry observability.

All skills successfully passed skills-validator testing.

Four new skills for the C# MCP server development lifecycle:
- mcp-csharp-create: Scaffolding with dotnet new mcpserver, tools/prompts/resources, transport config
- mcp-csharp-debug: MCP Inspector, VS Code integration, breakpoint debugging, logging
- mcp-csharp-test: Unit tests with ClientServerTestBase, integration with WebApplicationFactory, evals
- mcp-csharp-publish: NuGet packaging, Docker/Azure deployment, MCP Registry publishing

Each skill includes SKILL.md with progressive disclosure references/ and eval.yaml tests.
…te syntax

Replace scaffolding-heavy scenarios with implementation-focused ones that
test MCP-specific features (resources, prompts, logging). Fix assertion
patterns to match combined C# attribute syntax [McpServerTool, Description()]
instead of requiring standalone [McpServerTool]. Increase timeouts to 180s
to account for skill-reading overhead.

Validator result: passed=True, improvement=44.6% (threshold=10%)
Copilot AI review requested due to automatic review settings March 10, 2026 18:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a set of new .NET skill documents (create/debug/test/publish) for building MCP servers with the C# SDK, along with corresponding eval scenarios under tests/dotnet/ to validate the skills via skill-validator.

Changes:

  • Added four new MCP C# skills: creation, debugging, testing, and publishing/deployment.
  • Added reference guides covering SDK API patterns, transport configuration, testing patterns, and publishing/registry workflows.
  • Added eval scenarios for each new skill under tests/dotnet/.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/dotnet/mcp-csharp-create/eval.yaml Adds eval scenarios for MCP server scaffolding, attributes/DI, and HTTP setup.
tests/dotnet/mcp-csharp-debug/eval.yaml Adds eval scenarios for Inspector usage and IDE/Copilot configuration.
tests/dotnet/mcp-csharp-test/eval.yaml Adds eval scenarios for unit/integration testing and evaluation authoring.
tests/dotnet/mcp-csharp-publish/eval.yaml Adds eval scenarios for NuGet tool publishing, Azure deployment, and registry publishing.
plugins/dotnet/skills/mcp-csharp-create/SKILL.md New skill doc for creating MCP servers with C# SDK and templates.
plugins/dotnet/skills/mcp-csharp-create/references/api-patterns.md Reference for C# MCP SDK attributes, return types, DI, and builder patterns.
plugins/dotnet/skills/mcp-csharp-create/references/transport-config.md Reference for stdio/HTTP transport configuration, auth, and observability.
plugins/dotnet/skills/mcp-csharp-debug/SKILL.md New skill doc for running/debugging MCP servers and configuring IDEs.
plugins/dotnet/skills/mcp-csharp-debug/references/ide-config.md Detailed VS Code/Visual Studio MCP + debugger configuration examples.
plugins/dotnet/skills/mcp-csharp-debug/references/mcp-inspector.md Reference for using MCP Inspector across stdio/HTTP scenarios.
plugins/dotnet/skills/mcp-csharp-test/SKILL.md New skill doc for unit/integration testing and evaluations for MCP servers.
plugins/dotnet/skills/mcp-csharp-test/references/test-patterns.md Reference test patterns (in-memory, WebApplicationFactory, mocking).
plugins/dotnet/skills/mcp-csharp-test/references/evaluation-guide.md Reference guidance for creating deterministic, verifiable eval sets.
plugins/dotnet/skills/mcp-csharp-publish/SKILL.md New skill doc for packaging, Docker/Azure deployment, and registry publishing.
plugins/dotnet/skills/mcp-csharp-publish/references/nuget-packaging.md Reference for .csproj tool packaging and NuGet publishing flow.
plugins/dotnet/skills/mcp-csharp-publish/references/docker-azure.md Reference for Docker + Azure deployment commands and secret handling.
plugins/dotnet/skills/mcp-csharp-publish/references/mcp-registry.md Reference for server.json and mcp-publisher workflow/CI guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugins/dotnet/skills/mcp-csharp-test/references/test-patterns.md Outdated
Comment thread plugins/dotnet/skills/mcp-csharp-debug/SKILL.md Outdated
Comment thread plugins/dotnet/skills/mcp-csharp-publish/SKILL.md Outdated
Comment thread plugins/dotnet-ai/skills/mcp-csharp-publish/SKILL.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 10, 2026 19:02
leslierichardson95 and others added 2 commits March 10, 2026 12:03
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/dotnet-ai/mcp-csharp-create/eval.yaml
Comment thread plugins/dotnet-ai/skills/mcp-csharp-publish/SKILL.md
Comment thread plugins/dotnet/skills/mcp-csharp-debug/SKILL.md Outdated
Comment thread plugins/dotnet/skills/mcp-csharp-publish/references/nuget-packaging.md Outdated
…aging.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 10, 2026 19:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 10, 2026 19:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugins/dotnet/skills/mcp-csharp-test/references/evaluation-guide.md Outdated
Comment thread plugins/dotnet/skills/mcp-csharp-test/references/evaluation-guide.md Outdated
Copilot AI review requested due to automatic review settings March 10, 2026 19:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/CODEOWNERS
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eng/known-domains.txt Outdated
Comment thread .github/CODEOWNERS
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality (Isolated) Quality (Plugin) Skills Loaded Agents Invoked Overfit Verdict
mcp-csharp-create Implement MCP tools with proper attributes and DI 3.0/5 ⏰ → 4.0/5 ⏰ 🟢 3.0/5 ⏰ → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill, view / ✅ mcp-csharp-create; tools: skill, view — / — ✅ 0.10
mcp-csharp-create Create an HTTP MCP server with tools and resources 1.7/5 ⏰ → 4.3/5 ⏰ 🟢 1.7/5 ⏰ → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill, create, edit, view / ✅ mcp-csharp-create; tools: skill, edit, create, view — / — ✅ 0.10
mcp-csharp-create Create an MCP server with tools, prompts, and proper logging 3.0/5 ⏰ → 4.7/5 ⏰ 🟢 3.0/5 ⏰ → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill, edit, create / ✅ mcp-csharp-create; tools: skill, read_bash, create, edit — / — ✅ 0.10 [1]
mcp-csharp-publish Publish an MCP server as a NuGet tool package 3.0/5 → 4.0/5 🟢 3.0/5 → 4.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, glob / ✅ mcp-csharp-publish; tools: skill, glob — / — ✅ 0.18
mcp-csharp-publish Deploy an HTTP MCP server to Azure Container Apps 3.0/5 → 5.0/5 🟢 3.0/5 → 5.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, report_intent, view / ✅ mcp-csharp-publish; tools: skill, report_intent, view — / — ✅ 0.18
mcp-csharp-publish Publish to the MCP Registry 4.0/5 → 4.3/5 🟢 4.0/5 → 4.0/5 ✅ mcp-csharp-publish; tools: skill, view / ✅ mcp-csharp-publish; tools: skill, view — / — ✅ 0.18 [2]
mcp-csharp-debug Debug an MCP server with MCP Inspector 4.7/5 → 4.0/5 🔴 4.7/5 → 4.0/5 🔴 ✅ mcp-csharp-debug; tools: report_intent, skill, view / ✅ mcp-csharp-debug; tools: skill — / — ✅ 0.10
mcp-csharp-debug Configure VS Code to use an MCP server 4.0/5 → 5.0/5 🟢 4.0/5 → 4.7/5 🟢 ✅ mcp-csharp-debug; tools: skill, view, glob / ✅ mcp-csharp-debug; tools: skill, view, glob — / — ✅ 0.10
mcp-csharp-debug Debug a failing MCP server tool 3.3/5 → 3.7/5 🟢 3.3/5 → 3.3/5 ✅ mcp-csharp-debug; tools: report_intent, skill / ✅ mcp-csharp-debug; tools: skill — / — ✅ 0.10 [3]
mcp-csharp-test Write unit and integration tests for an MCP server 2.0/5 → 4.7/5 🟢 2.0/5 → 5.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view / ✅ mcp-csharp-test; tools: skill, report_intent, view — / — ✅ 0.18
mcp-csharp-test Test an HTTP MCP server with WebApplicationFactory 3.7/5 → 3.3/5 🔴 3.7/5 → 4.0/5 🟢 ✅ mcp-csharp-test; tools: report_intent, skill, view / ✅ mcp-csharp-test; tools: skill, report_intent, view — / — ✅ 0.18
mcp-csharp-test Create evaluations for an MCP server 2.0/5 → 2.0/5 2.0/5 → 2.0/5 ✅ mcp-csharp-test; tools: task, bash, grep, glob, skill / ✅ mcp-csharp-test; tools: skill explore / — ✅ 0.18 [4]

[1] (Isolated) Quality improved but weighted score is -47.0% due to: judgment, quality
[2] (Isolated) Quality improved but weighted score is -4.6% due to: judgment
[3] (Plugin) Quality unchanged but weighted score is -8.3% due to: tokens (11750 → 28911), tool calls (0 → 1), time (14.2s → 18.6s)
[4] (Isolated) Quality unchanged but weighted score is -23.7% due to: judgment, quality, tool calls (5 → 10), tokens (40305 → 46265)

timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@danmoseley
Copy link
Copy Markdown
Member

Created #440 so that on future PR's, local agent can help figure out next steps given an evaluation.

…oaches

The rubric criterion 'Shows how to attach a debugger' was too narrow.
The skilled answer correctly focused on the dotnet#1 cause (stdout pollution)
but scored low because it didn't show a specific 'attach to process' flow.
Broadened to accept any valid debugging approach: attaching, Debugger.Launch(),
or launch.json configuration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danmoseley
Copy link
Copy Markdown
Member

Note

This comment was AI/Copilot-generated.

Eval Results Analysis (run 23522050726)

The timeout increase to 360s fixed the create scenarios which were previously all 1.0/5. I also just pushed a debug rubric fix (511cb8d). Here's the full picture across all three eval runs:

Trend across runs

Skill Scenario Mar 20 Mar 25 (pre-fix) Mar 25 (post-fix) Status
create MCP tools + DI 3.0→4.3 ✅ 1.0→1.0 ❌ 3.0→4.0 ✅ Fixed (timeout)
create HTTP server 1.7→3.3 ✅ 1.0→1.0 ❌ 1.7→4.3 ✅ Fixed (timeout)
create Tools+prompts+logging 1.7→3.7 ✅ 2.3→1.0 ❌ 3.0→4.7 ❌ Pairwise variance¹
publish NuGet tool 3.0→4.0 ✅ Stable
publish Azure Container Apps 3.0→5.0 ✅ Stable
publish MCP Registry 4.0→4.3 ❌ Pairwise variance¹
debug MCP Inspector 4.7→4.0 ❌ Skill content²
debug VS Code config 4.0→5.0 ✅ Stable
debug Failing tool 3.3→3.7 ❌ Fixed (rubric)
test Unit + integration 2.0→4.7 ✅ Stable
test WebApplicationFactory 3.7→3.3 ❌ Skill content²
test Evaluations 2.0→2.0 ❌ Skill content²

Summary: 7/12 passing, likely 8–9 on re-run with fixes just pushed

¹ Pairwise variance = isolated quality improved but the pairwise LLM judge preferred baseline on this roll. No action needed; will fluctuate run-to-run.
² Skill content = the skill references don't cover the topic well enough for the agent to produce a strong answer.


What I fixed (just pushed to lerich/mcp-skills)

  1. Create timeouts (earlier commit 1289126): 180s/180s/default(120s) → 360s/360s/360s. This fixed all three create scenarios.
  2. Debug "failing tool" rubric (commit 511cb8d): Broadened "Shows how to attach a debugger to the running server process" to accept any valid debugging approach (attach, Debugger.Launch(), launch.json). The skilled answer correctly focused on stdout/stderr pollution (the Initial documentation and validation workflow #1 real-world cause) but was penalized for not showing a specific PID-attach workflow.

Remaining failures — what needs attention

Pairwise judge variance (no action needed)

These two scenarios actually improved in quality but the pairwise judge preferred baseline on this particular roll:

  • Create s3 "tools+prompts+logging": Isolated rubric scores went to 5/5 on all 4 criteria (up from 3.7 avg baseline). But pairwise judge preferred baseline's IHttpClientFactory pattern over skilled's bare HttpClient, and baseline's richer prompt construction. Will likely pass on re-run.
  • Publish "MCP Registry": Skilled scored 4.3 vs baseline 4.0, used 55% fewer tokens (88K→40K), 40% less time. Pairwise judge just disagreed. Will fluctuate.

Skill content gaps (for @leslierichardson95)

1. Debug "Inspector" (4.7→4.0)

The skilled answer only mentions tools when describing what Inspector shows. The rubric expects "tools, prompts, and resources."

  • File: plugins/dotnet-ai/skills/mcp-csharp-debug/references/mcp-inspector.md
  • What to change: The file has separate "Tool Testing", "Prompt Testing", "Resource Browsing" sections, but the intro only says "listing tools, calling them with custom parameters, and inspecting protocol messages." Add an intro line like: "Provides a web UI for testing tools, prompts, and resources" so the agent picks up all three when summarizing.

2. Test "WebApplicationFactory" (3.7→3.3)

The skilled answer shows WebApplicationFactory<Program> setup but never demonstrates an actual tool call through the HTTP endpoint. The rubric criterion "Tests tool invocation through the HTTP endpoint" scored 1.7/5 vs baseline's 3.7/5.

  • File: plugins/dotnet-ai/skills/mcp-csharp-test/references/test-patterns.md
  • What to change: The "HTTP Testing with WebApplicationFactory" section shows CreateClient() and PostAsJsonAsync for raw HTTP, but doesn't show a tool invocation example. Add an example showing how to send a tools/call JSON-RPC request through the HTTP endpoint and verify the response — similar to how the ClientServerTestBase section shows client.CallToolAsync("my_tool", ...) but adapted for HTTP.

3. Test "evaluations" (2.0→2.0, both baseline AND skilled fail)

The skilled answer literally said: "This is about LLM evaluations... outside the scope of the MCP server testing skill." Neither baseline nor skilled can answer this well — the skill simply doesn't have content about evaluation authoring.

  • File to create: plugins/dotnet-ai/skills/mcp-csharp-test/references/evaluation-authoring.md (new)
  • What to add: Content covering the XML qa_pair evaluation format, what makes good evaluation questions (read-only, deterministic, require multi-tool reasoning), and example questions for a product catalog scenario. Then link it from the skill's SKILL.md.
  • This is the biggest gap — without reference content, the agent has nothing to draw on.

Recommended next steps

  1. Re-run eval (/evaluate) to pick up the two fixes just pushed (debug rubric + timeouts). The 2 pairwise-variance failures may also resolve on a fresh roll.
  2. Address the 3 skill content gaps above — these are the systemic issues that won't resolve with re-runs.

@danmoseley
Copy link
Copy Markdown
Member

Next action here on @leslierichardson95 -- hopefully above is helpful -- I'll try to get the "improved analysis guidance tailored for agents" merged in parallel

Copilot AI review requested due to automatic review settings March 25, 2026 16:57
@danmoseley
Copy link
Copy Markdown
Member

Merged my part, reevaluating

@danmoseley
Copy link
Copy Markdown
Member

/evaluate

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/dotnet-ai/mcp-csharp-test/eval.yaml
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality (Isolated) Quality (Plugin) Skills Loaded Agents Invoked Overfit Verdict
mcp-csharp-create Implement MCP tools with proper attributes and DI 2.7/5 ⏰ → 4.0/5 🟢 2.7/5 ⏰ → 4.3/5 🟢 ✅ mcp-csharp-create; tools: skill / ✅ mcp-csharp-create; tools: skill — / — ✅ 0.07
mcp-csharp-create Create an HTTP MCP server with tools and resources 1.7/5 ⏰ → 4.0/5 🟢 1.7/5 ⏰ → 4.0/5 🟢 ✅ mcp-csharp-create; tools: skill, stop_bash / ✅ mcp-csharp-create; tools: skill, read_bash, stop_bash — / — ✅ 0.07
mcp-csharp-create Create an MCP server with tools, prompts, and proper logging 2.3/5 ⏰ → 4.0/5 🟢 2.3/5 ⏰ → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill, create, write_bash, stop_bash / ✅ mcp-csharp-create; tools: skill, create — / — ✅ 0.07
mcp-csharp-test Write unit and integration tests for an MCP server 2.0/5 → 4.7/5 🟢 2.0/5 → 5.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view / ✅ mcp-csharp-test; tools: skill, report_intent, view — / — 🟡 0.21
mcp-csharp-test Test an HTTP MCP server with WebApplicationFactory 3.7/5 → 3.7/5 3.7/5 → 4.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view / ✅ mcp-csharp-test; tools: skill, report_intent, view — / — 🟡 0.21 [1]
mcp-csharp-test Create evaluations for an MCP server 2.0/5 → 1.7/5 ⏰ 🔴 2.0/5 → 2.0/5 ⚠️ NOT ACTIVATED / ✅ mcp-csharp-test; tools: report_intent, skill, view explore / — 🟡 0.21
mcp-csharp-debug Debug an MCP server with MCP Inspector 4.7/5 → 4.3/5 🔴 4.7/5 → 4.7/5 ✅ mcp-csharp-debug; tools: report_intent, skill / ✅ mcp-csharp-debug; tools: skill — / — ✅ 0.16 [2]
mcp-csharp-debug Configure VS Code to use an MCP server 4.0/5 → 4.0/5 4.0/5 → 5.0/5 🟢 ✅ mcp-csharp-debug; tools: skill, view, glob / ✅ mcp-csharp-debug; tools: skill, view, glob — / — ✅ 0.16
mcp-csharp-debug Debug a failing MCP server tool 4.7/5 → 4.0/5 🔴 4.7/5 → 3.3/5 🔴 ✅ mcp-csharp-debug; tools: report_intent, skill / ✅ mcp-csharp-debug; tools: skill — / — ✅ 0.16
mcp-csharp-publish Publish an MCP server as a NuGet tool package 3.0/5 → 4.0/5 🟢 3.0/5 → 4.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, glob / ✅ mcp-csharp-publish; tools: skill, glob — / — 🟡 0.22
mcp-csharp-publish Deploy an HTTP MCP server to Azure Container Apps 3.0/5 → 5.0/5 🟢 3.0/5 → 5.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, report_intent, view / ✅ mcp-csharp-publish; tools: skill, report_intent, view — / — 🟡 0.22
mcp-csharp-publish Publish to the MCP Registry 3.7/5 → 4.0/5 🟢 3.7/5 → 4.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, view / ✅ mcp-csharp-publish; tools: skill, view — / — 🟡 0.22

[1] (Plugin) Quality improved but weighted score is -5.8% due to: tokens (12684 → 45818), tool calls (0 → 3)
[2] (Plugin) Quality unchanged but weighted score is -7.9% due to: tokens (12015 → 29458), tool calls (0 → 1)

timeout — run(s) hit the (120s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

📖 See InvestigatingResults.md for how to diagnose failures. Additional debugging guidance may be provided by your workflow.

Full results

To investigate failures, paste this to your AI coding agent:

Download eval artifacts with gh run download 23553427829 --repo dotnet/skills --dir /tmp/eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/8a95d954e9d05b5b6120c39259d27d96bf9e1987/eng/skill-validator/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

@danmoseley
Copy link
Copy Markdown
Member

Note

This analysis was generated by GitHub Copilot, following the "Download eval artifacts" investigation guidance from the evaluation table comment.

Eval Failure Analysis

I downloaded the eval artifacts (gh run download 23553427829 --repo dotnet/skills --pattern "skill-validator-results-*") and followed InvestigatingResults.md to diagnose the 4 failing scenarios. Here's what I found, using the guide's failure pattern taxonomy.


Failure 1: mcp-csharp-debug / "Debug an MCP server with MCP Inspector"

Score: iso=+4.3%, plug=‑7.9% · Pattern #4 — Quality unchanged, token overhead kills it

Rubrics were a wash (2 ties, baseline won 1, skill won 1). But tokens inflated 12K→29K (2.4×) and tool calls went 0→1. The ‑1.00 token and tool reduction penalties drag the weighted score below zero despite neutral quality.

Fix: Trim skill content size. Quality is genuinely neutral — the skill needs to be more concise or produce clearly differentiated output to offset its token cost.


Failure 2: mcp-csharp-debug / "Debug a failing MCP server tool"

Score: iso=‑21.1%, plug=‑29.7% · Pattern #4 + actual quality regression

Most concerning failure. All 3 per-run scores are negative [‑0.45, ‑0.40, ‑0.33] — consistently worse, not variance. Baseline won 2 rubrics (attaching a debugger, broader debugging approach), skill won 1 (stderr recommendation). The skilled output was shorter (1,331 chars vs 1,998 baseline) and narrower — it focused almost entirely on stdout corruption and stale builds, while the baseline covered a broader range of approaches (file logging, VS Code output panel, common culprits).

Fix: The skill is over-indexing on stderr/stdout as the debugging narrative. It should also cover attaching debuggers, checking VS Code MCP output channels, and other diagnostic approaches the rubric expects. This is a skill content quality issue.


Failure 3: mcp-csharp-test / "Test HTTP MCP server with WebApplicationFactory"

Score: iso=‑5.1%, plug=‑5.8% · Pattern #4 — Slight quality improvement overwhelmed by overhead

Quality actually improved slightly (qual=+0.05–0.07), but tokens went 12K→43–45K (3.5×!) and tool calls 0→3. Rubrics split: skill won on InternalsVisibleTo, baseline won on MCP initialize requests and HTTP tool invocation testing. Per-run scores [‑0.06, ‑0.13, ‑0.30] show variance.

Fix: Borderline case. Reducing skill content size would help the token penalty. The rubric also expects "MCP initialize request" and "tool invocation through HTTP" coverage — the skill should ensure it covers those patterns, not just the InternalsVisibleTo setup detail.


Failure 4: mcp-csharp-test / "Create evaluations for an MCP server"

Score: iso=‑30.0%, plug=‑24.0% · Patterns #5 (Not activated) + #1 (Timeout) + #2 (Baseline already weak)

Multi-pattern pile-up:

  • Isolated: Skill was NOT ACTIVATED. The agent went exploring on its own — 26 tool calls (7 bash, 3 glob, 2 grep), 163K tokens, timed out. It never loaded the skill.
  • Plugin: Skill activated, but a key assertion failed — output didn't mention "read-only/non-destructive/deterministic" evaluation guidance.
  • Baseline quality was already low (~2.0/5).

Fix priority:

  1. Fix activation — the skill's description frontmatter likely doesn't mention "evaluations" or "eval", so the runtime doesn't select it. Add those keywords.
  2. Fix content — even when activated (plugin run), the skill didn't cover the "read-only/deterministic" requirement. Add a section on writing good evaluation questions.
  3. Consider increasing timeout from 120s, though fixing activation is the real fix.

Summary — What to fix first

Priority Scenario Action
1 Create evaluations Fix skill activation (add "evaluation" to description), add eval-writing guidance to skill content
2 Debug a failing tool Broaden skill content beyond stderr — cover debugger attachment, VS Code output panel
3 WebApplicationFactory Reduce skill size; add MCP initialize request / HTTP invocation patterns
4 MCP Inspector Trim skill content to reduce token overhead

The dominant theme across 3 of 4 failures is Pattern #4 (token overhead) — the skills are too large relative to the quality improvement they deliver.


Side note on the investigation flow itself: the gh run download command in the eval table instructions fails with exit code 1 because the workflow run includes a skill-validator-dist.tar.gz artifact that gh can't extract as zip. Adding --pattern "skill-validator-results-*" to the download command avoids this. The InvestigatingResults.md guide was excellent — the failure pattern taxonomy mapped cleanly to every issue found.

mcp-csharp-debug:
- Trim SKILL.md verbosity (183->160 lines) and mcp-inspector.md (67->54 lines)
- Add Diagnosing Tool Errors section covering debugger, output panel, Inspector, common culprits
- Rebalance debugging narrative away from stderr-only focus
- Move HTTP logging config to ide-config.md reference

mcp-csharp-test:
- Add eval/evaluations keywords to frontmatter for activation
- Add HTTP tool invocation test pattern (tools/call via WebApplicationFactory)
- Trim test-patterns.md bloat (remove Test Categories, Coverage, Input Validation)
- Create references/evaluations.md with qa_pair format and read-only/deterministic guidance

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@leslierichardson95
Copy link
Copy Markdown
Contributor Author

/evaluate

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
mcp-csharp-create Implement MCP tools with proper attributes and DI 3.7/5 → 4.0/5 🟢 ✅ mcp-csharp-create; tools: skill, edit, view / ✅ mcp-csharp-create; tools: skill, edit, view ✅ 0.06
mcp-csharp-create Create an HTTP MCP server with tools and resources 2.3/5 ⏰ → 4.3/5 🟢 ✅ mcp-csharp-create; tools: skill, stop_bash / ✅ mcp-csharp-create; tools: skill, stop_bash ✅ 0.06
mcp-csharp-create Create an MCP server with tools, prompts, and proper logging 3.0/5 ⏰ → 4.3/5 🟢 ✅ mcp-csharp-create; tools: skill, edit, create, read_bash / ✅ mcp-csharp-create; tools: skill, edit, create ✅ 0.06
mcp-csharp-test Write unit and integration tests for an MCP server 2.0/5 → 4.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view / ✅ mcp-csharp-test; tools: skill, report_intent, view ✅ 0.16
mcp-csharp-test Test an HTTP MCP server with WebApplicationFactory 3.3/5 → 5.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view / ✅ mcp-csharp-test; tools: skill, report_intent, view ✅ 0.16
mcp-csharp-test Create evaluations for an MCP server 2.7/5 → 5.0/5 🟢 ✅ mcp-csharp-test; tools: skill / ✅ mcp-csharp-test; tools: skill ✅ 0.16
mcp-csharp-debug Debug an MCP server with MCP Inspector 4.0/5 → 4.0/5 ✅ mcp-csharp-debug; tools: report_intent, skill / ✅ mcp-csharp-debug; tools: report_intent, skill ✅ 0.07 [1]
mcp-csharp-debug Configure VS Code to use an MCP server 4.3/5 → 4.7/5 🟢 ✅ mcp-csharp-debug; tools: skill, view, glob / ✅ mcp-csharp-debug; tools: skill ✅ 0.07
mcp-csharp-debug Debug a failing MCP server tool 3.7/5 → 4.0/5 🟢 ✅ mcp-csharp-debug; tools: report_intent, skill / ✅ mcp-csharp-debug; tools: skill ✅ 0.07 [2]
mcp-csharp-publish Publish an MCP server as a NuGet tool package 3.0/5 → 4.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, glob / ✅ mcp-csharp-publish; tools: skill, glob ✅ 0.11
mcp-csharp-publish Deploy an HTTP MCP server to Azure Container Apps 3.0/5 → 5.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, report_intent, view / ✅ mcp-csharp-publish; tools: skill, report_intent, view ✅ 0.11
mcp-csharp-publish Publish to the MCP Registry 2.7/5 → 4.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, view, report_intent / ✅ mcp-csharp-publish; tools: skill, view, report_intent ✅ 0.11

[1] (Plugin) Quality unchanged but weighted score is -10.7% due to: tokens (12072 → 29517), quality, tool calls (0 → 2)
[2] (Isolated) Quality improved but weighted score is -31.4% due to: quality, judgment, tokens (12175 → 27900), tool calls (0 → 2)

timeout — run(s) hit the (360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

📖 See InvestigatingResults.md for how to diagnose failures. Additional debugging guidance may be provided by your workflow.

🔍 Full results — includes quality and agent details

To investigate failures, paste this to your AI coding agent:

Download eval artifacts with gh run download 23560926225 --repo dotnet/skills --dir /tmp/eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/c7f8110f791582174804a80f6a2ce1e0d656cfb7/eng/skill-validator/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

@danmoseley danmoseley enabled auto-merge (squash) March 25, 2026 20:11
@danmoseley danmoseley merged commit 7286c31 into dotnet:main Mar 25, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants