Skip to content

Add exp-test-maintainability skill for test structure review#418

Merged
Evangelink merged 7 commits intomainfrom
dev/amauryleve/test-maintainability
Mar 26, 2026
Merged

Add exp-test-maintainability skill for test structure review#418
Evangelink merged 7 commits intomainfrom
dev/amauryleve/test-maintainability

Conversation

@Evangelink
Copy link
Copy Markdown
Member

Summary

New experimental skill that assesses test maintainability: duplication, test size, data-driven patterns, display names, builder/helper extraction, and shared setup.

Key features

  • Test size and focus — flags oversized tests (>20 lines) and multi-concern tests
  • Duplication detection — identifies 3+ copy-pasted test bodies or setup blocks, recommends [DataRow]/[Theory] or builder extraction
  • Data-driven quality — recommends DisplayName for non-obvious parameterized values
  • Test data construction — identifies scattered object builders, recommends centralized factories
  • Honest calibration — respects intentional verbosity, only recommends extraction for 3+ occurrences, doesn't suggest builders for trivial objects

Evaluation results (3 runs, claude-opus-4.6)

Scenario Baseline Isolated Plugin Verdict
Data-driven patterns + display names 4.0/5 5.0/5 4.7/5
Well-maintained recognition 4.0/5 5.0/5 5.0/5
Builder extraction 4.7/5 5.0/5 5.0/5 ❌ ¹
Oversized tests 5.0/5 5.0/5 5.0/5 ❌ ¹

¹ Quality improved or matched but weighted score penalized by token overhead. The baseline LLM already excels at refactoring recommendations — the skill's strongest value is on judgment calls (recognizing well-maintained tests, calibrating when NOT to recommend changes).

New experimental skill that assesses test maintainability: duplication,
test size, data-driven patterns, display names, builder/helper extraction,
and shared setup.

Evaluation results (3 runs, claude-opus-4.6):
- Data-driven patterns: 4.0 -> 5.0 (isolated), 4.7 (plugin) - PASS
- Well-maintained recognition: 4.0 -> 5.0 (both modes) - PASS
- Builder extraction: 4.7 -> 5.0 (both) - overhead penalty (baseline too strong)
- Oversized tests: 5.0 -> 5.0 (both) - overhead penalty (baseline too strong)

The skill's strongest value is on judgment calls (recognizing well-maintained
tests, calibrating when NOT to recommend changes) rather than refactoring
suggestions where the baseline LLM already excels.
Copilot AI review requested due to automatic review settings March 20, 2026 15:00
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality (Isolated) Quality (Plugin) Skills Loaded Overfit Verdict
exp-test-maintainability Identify duplicated test setup and recommend builder extraction 4.7/5 → 5.0/5 🟢 4.7/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.12 [1]
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 4.7/5 🟢 4.0/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.12 [2]
exp-test-maintainability Recognize well-maintained tests that need minimal changes 4.0/5 → 5.0/5 🟢 4.0/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: skill, report_intent ✅ 0.12
exp-test-maintainability Identify oversized tests that test multiple concerns 5.0/5 → 5.0/5 5.0/5 → 5.0/5 ✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.12 [3]

[1] (Plugin) Quality improved but weighted score is -3.2% due to: tokens (13988 → 30984), tool calls (0 → 2), time (25.0s → 71.2s)
[2] (Isolated) Quality improved but weighted score is -10.0% due to: tokens (12911 → 28606), tool calls (0 → 2), time (16.5s → 53.1s)
[3] (Isolated) Quality unchanged but weighted score is -26.8% due to: judgment, quality, tokens (14007 → 29787), tool calls (0 → 2), time (57.7s → 78.7s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

Replace pure-refactoring scenarios (where baseline scores 5.0) with
mixed-quality scenarios that test selective judgment. Results:
- Mixed-quality selective assessment: baseline 5.0 = skill 5.0 (overhead penalty)
- Data-driven + display names: baseline 4.0 -> skill 4.7 (overhead penalty on plugin)
- Well-maintained recognition: baseline 4.0 -> skill 5.0 - PASS
- Oversized among well-sized: baseline 5.0 = skill 5.0 (overhead penalty)
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Evangelink
Copy link
Copy Markdown
Member Author

Moving back to draft, it seems it will need more work/thoughts.

@Evangelink Evangelink marked this pull request as draft March 20, 2026 15:54
…esign notes

- Trim SKILL.md: remove Steps 4-5, validation checklist, and pitfalls table
  (~25% token reduction). Add calibration rule preferring DataRow+DisplayName
  over DynamicData for compile-time constants.
- Remove 'Oversized tests' eval scenario (baseline at 5.0/5 ceiling).
- Add design notes documenting evaluation rounds and rationale.
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality (Isolated) Quality (Plugin) Skills Loaded Overfit Verdict
exp-test-maintainability Selectively recommend changes in mixed-quality test suite 5.0/5 → 5.0/5 5.0/5 → 5.0/5 ✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.10 [1]
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 5.0/5 🟢 4.0/5 → 4.0/5 ✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.10
exp-test-maintainability Recognize well-maintained tests that need minimal changes 3.7/5 → 5.0/5 🟢 3.7/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: skill, report_intent ✅ 0.10

[1] (Plugin) Quality unchanged but weighted score is -11.0% due to: tokens (12942 → 29566), tool calls (0 → 2), time (10.4s → 31.1s), quality

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality (Isolated) Quality (Plugin) Skills Loaded Overfit Verdict
exp-test-maintainability Selectively recommend changes in mixed-quality test suite 5.0/5 → 5.0/5 5.0/5 → 5.0/5 ✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.09 [1]
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 4.3/5 🟢 4.0/5 → 4.3/5 🟢 ✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: skill, report_intent ✅ 0.09 [2]
exp-test-maintainability Recognize well-maintained tests that need minimal changes 4.7/5 → 5.0/5 🟢 4.7/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: report_intent, skill / ✅ exp-test-maintainability; tools: skill, report_intent ✅ 0.09 [3]

[1] (Plugin) Quality unchanged but weighted score is -11.0% due to: tokens (13388 → 35555), tool calls (0 → 2), time (16.9s → 34.6s), quality
[2] (Plugin) Quality improved but weighted score is -4.2% due to: tokens (13148 → 30114), tool calls (0 → 1), time (15.7s → 33.6s)
[3] (Isolated) Quality improved but weighted score is -15.0% due to: judgment, tokens (12736 → 37977), tool calls (0 → 2), time (17.3s → 30.4s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

📖 See InvestigatingResults.md for how to diagnose failures. Additional debugging guidance may be provided by your workflow.

Full results

To investigate failures, paste this to your AI coding agent:

Download eval artifacts with gh run download 23554538807 --repo dotnet/skills --dir /tmp/eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/c7ee0838d78a9f94bdddfc23654a63729a3549e6/eng/skill-validator/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

- SKILL.md: Remove When to Use/Not to Use, Inputs, Steps 1-2 detection tables,
  Step 4 report format. Keep only calibration rules (~75% token reduction).
- eval.yaml: Remove 'Selectively recommend changes' scenario (baseline at
  ceiling 5.0/5, cannot pass). Reword rubric items to test outcomes not
  skill vocabulary. Remove narrow 'builder pattern' assertion.
- Design notes: Record rounds 3-4 results and decisions.

Results: 2/2 scenarios pass, overfit 0.13 (green), quality +0.7-1.0/5.
@Evangelink Evangelink marked this pull request as ready for review March 26, 2026 10:24
Copilot AI review requested due to automatic review settings March 26, 2026 10:24
@Evangelink Evangelink enabled auto-merge (squash) March 26, 2026 10:24
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

Comment thread docs/dotnet-experimental/exp-test-maintainability-design-notes.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

plugins/dotnet-experimental/skills/exp-test-maintainability/SKILL.md:18

  • These calibration rules recommend adding DisplayName and preferring DataRow over DynamicData, which is correct for MSTest but doesn’t map cleanly to xUnit/NUnit/TUnit as-written. Consider rephrasing in framework-agnostic terms (e.g., “prefer inline/attribute-based test cases over source-based/dynamic case sources for compile-time constants”) and mention the framework-specific naming mechanisms so the skill doesn’t push DisplayName/DataRow in non-MSTest test suites.
- **Display names matter most for non-obvious values.** `[DataRow("Gold", 100.0, 90.0)]` is self-explanatory. `[DataRow(3, 7, 42)]` is not — add `DisplayName`.
- **Prefer `[DataRow]` with `DisplayName` over `[DynamicData]`** when all values are compile-time constants. `[DataRow]` is simpler. Reserve `[DynamicData]` for computed or complex values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 3.7/5 → 4.7/5 🟢 ✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: skill ✅ 0.09
exp-test-maintainability Recognize well-maintained tests that need minimal changes 4.7/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: skill, report_intent ✅ 0.09

Model: claude-opus-4.6 | Judge: claude-opus-4.6

📖 See InvestigatingResults.md for how to diagnose failures. Additional debugging guidance may be provided by your workflow.

🔍 Full results — includes quality and agent details

To investigate failures, paste this to your AI coding agent:

Download eval artifacts with gh run download 23589391216 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/3668aec951209561e2f394ceb1721b5459e5c9e6/eng/skill-validator/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

@Evangelink Evangelink merged commit a220f36 into main Mar 26, 2026
34 checks passed
@Evangelink Evangelink deleted the dev/amauryleve/test-maintainability branch March 26, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants