Add exp-test-maintainability skill for test structure review by Evangelink · Pull Request #418 · dotnet/skills

Evangelink · 2026-03-20T15:00:47Z

Summary

New experimental skill that assesses test maintainability: duplication, test size, data-driven patterns, display names, builder/helper extraction, and shared setup.

Key features

Test size and focus — flags oversized tests (>20 lines) and multi-concern tests
Duplication detection — identifies 3+ copy-pasted test bodies or setup blocks, recommends [DataRow]/[Theory] or builder extraction
Data-driven quality — recommends DisplayName for non-obvious parameterized values
Test data construction — identifies scattered object builders, recommends centralized factories
Honest calibration — respects intentional verbosity, only recommends extraction for 3+ occurrences, doesn't suggest builders for trivial objects

Evaluation results (3 runs, claude-opus-4.6)

Scenario	Baseline	Isolated	Plugin	Verdict
Data-driven patterns + display names	4.0/5	5.0/5	4.7/5	✅
Well-maintained recognition	4.0/5	5.0/5	5.0/5	✅
Builder extraction	4.7/5	5.0/5	5.0/5	❌ ¹
Oversized tests	5.0/5	5.0/5	5.0/5	❌ ¹

¹ Quality improved or matched but weighted score penalized by token overhead. The baseline LLM already excels at refactoring recommendations — the skill's strongest value is on judgment calls (recognizing well-maintained tests, calibrating when NOT to recommend changes).

New experimental skill that assesses test maintainability: duplication, test size, data-driven patterns, display names, builder/helper extraction, and shared setup. Evaluation results (3 runs, claude-opus-4.6): - Data-driven patterns: 4.0 -> 5.0 (isolated), 4.7 (plugin) - PASS - Well-maintained recognition: 4.0 -> 5.0 (both modes) - PASS - Builder extraction: 4.7 -> 5.0 (both) - overhead penalty (baseline too strong) - Oversized tests: 5.0 -> 5.0 (both) - overhead penalty (baseline too strong) The skill's strongest value is on judgment calls (recognizing well-maintained tests, calibrating when NOT to recommend changes) rather than refactoring suggestions where the baseline LLM already excels.

Evangelink · 2026-03-20T15:19:50Z

/evaluate

github-actions · 2026-03-20T15:24:59Z

Skill Validation Results

Skill	Scenario	Quality (Isolated)	Quality (Plugin)	Skills Loaded	Overfit	Verdict
exp-test-maintainability	Identify duplicated test setup and recommend builder extraction	4.7/5 → 5.0/5 🟢	4.7/5 → 5.0/5 🟢	✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.12	❌ [1]
exp-test-maintainability	Recommend data-driven patterns with display names for unclear parameters	4.0/5 → 4.7/5 🟢	4.0/5 → 5.0/5 🟢	✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.12	❌ [2]
exp-test-maintainability	Recognize well-maintained tests that need minimal changes	4.0/5 → 5.0/5 🟢	4.0/5 → 5.0/5 🟢	✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: skill, report_intent	✅ 0.12	✅
exp-test-maintainability	Identify oversized tests that test multiple concerns	5.0/5 → 5.0/5	5.0/5 → 5.0/5	✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.12	❌ [3]

[1] (Plugin) Quality improved but weighted score is -3.2% due to: tokens (13988 → 30984), tool calls (0 → 2), time (25.0s → 71.2s)
[2] (Isolated) Quality improved but weighted score is -10.0% due to: tokens (12911 → 28606), tool calls (0 → 2), time (16.5s → 53.1s)
[3] (Isolated) Quality unchanged but weighted score is -26.8% due to: judgment, quality, tokens (14007 → 29787), tool calls (0 → 2), time (57.7s → 78.7s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

Replace pure-refactoring scenarios (where baseline scores 5.0) with mixed-quality scenarios that test selective judgment. Results: - Mixed-quality selective assessment: baseline 5.0 = skill 5.0 (overhead penalty) - Data-driven + display names: baseline 4.0 -> skill 4.7 (overhead penalty on plugin) - Well-maintained recognition: baseline 4.0 -> skill 5.0 - PASS - Oversized among well-sized: baseline 5.0 = skill 5.0 (overhead penalty)

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Evangelink · 2026-03-20T15:54:13Z

Moving back to draft, it seems it will need more work/thoughts.

…esign notes - Trim SKILL.md: remove Steps 4-5, validation checklist, and pitfalls table (~25% token reduction). Add calibration rule preferring DataRow+DisplayName over DynamicData for compile-time constants. - Remove 'Oversized tests' eval scenario (baseline at 5.0/5 ceiling). - Add design notes documenting evaluation rounds and rationale.

Evangelink · 2026-03-23T16:40:58Z

/evaluate

github-actions · 2026-03-23T16:45:25Z

Skill Validation Results

Skill	Scenario	Quality (Isolated)	Quality (Plugin)	Skills Loaded	Overfit	Verdict
exp-test-maintainability	Selectively recommend changes in mixed-quality test suite	5.0/5 → 5.0/5	5.0/5 → 5.0/5	✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.10	❌ [1]
exp-test-maintainability	Recommend data-driven patterns with display names for unclear parameters	4.0/5 → 5.0/5 🟢	4.0/5 → 4.0/5	✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.10	✅
exp-test-maintainability	Recognize well-maintained tests that need minimal changes	3.7/5 → 5.0/5 🟢	3.7/5 → 5.0/5 🟢	✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: skill, report_intent	✅ 0.10	✅

[1] (Plugin) Quality unchanged but weighted score is -11.0% due to: tokens (12942 → 29566), tool calls (0 → 2), time (10.4s → 31.1s), quality

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

Evangelink · 2026-03-25T17:22:16Z

/evaluate

github-actions · 2026-03-25T17:27:11Z

Skill Validation Results

Skill	Scenario	Quality (Isolated)	Quality (Plugin)	Skills Loaded	Overfit	Verdict
exp-test-maintainability	Selectively recommend changes in mixed-quality test suite	5.0/5 → 5.0/5	5.0/5 → 5.0/5	✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.09	❌ [1]
exp-test-maintainability	Recommend data-driven patterns with display names for unclear parameters	4.0/5 → 4.3/5 🟢	4.0/5 → 4.3/5 🟢	✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: skill, report_intent	✅ 0.09	❌ [2]
exp-test-maintainability	Recognize well-maintained tests that need minimal changes	4.7/5 → 5.0/5 🟢	4.7/5 → 5.0/5 🟢	✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: skill, report_intent	✅ 0.09	❌ [3]

[1] (Plugin) Quality unchanged but weighted score is -11.0% due to: tokens (13388 → 35555), tool calls (0 → 2), time (16.9s → 34.6s), quality
[2] (Plugin) Quality improved but weighted score is -4.2% due to: tokens (13148 → 30114), tool calls (0 → 1), time (15.7s → 33.6s)
[3] (Isolated) Quality improved but weighted score is -15.0% due to: judgment, tokens (12736 → 37977), tool calls (0 → 2), time (17.3s → 30.4s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

📖 See InvestigatingResults.md for how to diagnose failures. Additional debugging guidance may be provided by your workflow.

Full results

To investigate failures, paste this to your AI coding agent:

Download eval artifacts with gh run download 23554538807 --repo dotnet/skills --dir /tmp/eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/c7ee0838d78a9f94bdddfc23654a63729a3549e6/eng/skill-validator/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

- SKILL.md: Remove When to Use/Not to Use, Inputs, Steps 1-2 detection tables, Step 4 report format. Keep only calibration rules (~75% token reduction). - eval.yaml: Remove 'Selectively recommend changes' scenario (baseline at ceiling 5.0/5, cannot pass). Reword rubric items to test outcomes not skill vocabulary. Remove narrow 'builder pattern' assertion. - Design notes: Record rounds 3-4 results and decisions. Results: 2/2 scenarios pass, overfit 0.13 (green), quality +0.7-1.0/5.

Evangelink · 2026-03-26T10:25:10Z

/evaluate

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

plugins/dotnet-experimental/skills/exp-test-maintainability/SKILL.md:18

These calibration rules recommend adding DisplayName and preferring DataRow over DynamicData, which is correct for MSTest but doesn’t map cleanly to xUnit/NUnit/TUnit as-written. Consider rephrasing in framework-agnostic terms (e.g., “prefer inline/attribute-based test cases over source-based/dynamic case sources for compile-time constants”) and mention the framework-specific naming mechanisms so the skill doesn’t push DisplayName/DataRow in non-MSTest test suites.

- **Display names matter most for non-obvious values.** `[DataRow("Gold", 100.0, 90.0)]` is self-explanatory. `[DataRow(3, 7, 42)]` is not — add `DisplayName`.
- **Prefer `[DataRow]` with `DisplayName` over `[DynamicData]`** when all values are compile-time constants. `[DataRow]` is simpler. Reserve `[DynamicData]` for computed or complex values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-03-26T10:30:02Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
exp-test-maintainability	Recommend data-driven patterns with display names for unclear parameters	3.7/5 → 4.7/5 🟢	✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: skill	✅ 0.09	✅
exp-test-maintainability	Recognize well-maintained tests that need minimal changes	4.7/5 → 5.0/5 🟢	✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: skill, report_intent	✅ 0.09	✅

Model: claude-opus-4.6 | Judge: claude-opus-4.6

📖 See InvestigatingResults.md for how to diagnose failures. Additional debugging guidance may be provided by your workflow.

🔍 Full results — includes quality and agent details

To investigate failures, paste this to your AI coding agent:

Download eval artifacts with gh run download 23589391216 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/3668aec951209561e2f394ceb1721b5459e5c9e6/eng/skill-validator/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

Copilot AI review requested due to automatic review settings March 20, 2026 15:00

Copilot started reviewing on behalf of Evangelink March 20, 2026 15:01 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Evangelink marked this pull request as draft March 20, 2026 15:54

Evangelink added 2 commits March 25, 2026 18:15

Merge branch 'main' into dev/amauryleve/test-maintainability

88ad310

Move the skill

c7ee083

Evangelink added 2 commits March 25, 2026 18:28

Update codewoners

87eb917

Evangelink marked this pull request as ready for review March 26, 2026 10:24

Evangelink requested review from dbreshears and timheuer as code owners March 26, 2026 10:24

Copilot AI review requested due to automatic review settings March 26, 2026 10:24

Evangelink enabled auto-merge (squash) March 26, 2026 10:24

Copilot started reviewing on behalf of Evangelink March 26, 2026 10:25 View session

JanKrivanek approved these changes Mar 26, 2026

View reviewed changes

Comment thread docs/dotnet-experimental/exp-test-maintainability-design-notes.md

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Comment thread plugins/dotnet-experimental/skills/exp-test-maintainability/SKILL.md

Evangelink merged commit a220f36 into main Mar 26, 2026
34 checks passed

Evangelink deleted the dev/amauryleve/test-maintainability branch March 26, 2026 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exp-test-maintainability skill for test structure review#418

Add exp-test-maintainability skill for test structure review#418
Evangelink merged 7 commits intomainfrom
dev/amauryleve/test-maintainability

Evangelink commented Mar 20, 2026

Uh oh!

Evangelink commented Mar 20, 2026

Uh oh!

github-actions Bot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Evangelink commented Mar 20, 2026

Uh oh!

Evangelink commented Mar 23, 2026

Uh oh!

github-actions Bot commented Mar 23, 2026

Uh oh!

Evangelink commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

Evangelink commented Mar 26, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions Bot commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Evangelink commented Mar 20, 2026

Summary

Key features

Evaluation results (3 runs, claude-opus-4.6)

Uh oh!

Evangelink commented Mar 20, 2026

Uh oh!

github-actions Bot commented Mar 20, 2026

Skill Validation Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Evangelink commented Mar 20, 2026

Uh oh!

Evangelink commented Mar 23, 2026

Uh oh!

github-actions Bot commented Mar 23, 2026

Skill Validation Results

Uh oh!

Evangelink commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Skill Validation Results

Uh oh!

Evangelink commented Mar 26, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions Bot commented Mar 26, 2026

Skill Validation Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants