Add exp-test-maintainability skill for test structure review#418
Add exp-test-maintainability skill for test structure review#418Evangelink merged 7 commits intomainfrom
Conversation
New experimental skill that assesses test maintainability: duplication, test size, data-driven patterns, display names, builder/helper extraction, and shared setup. Evaluation results (3 runs, claude-opus-4.6): - Data-driven patterns: 4.0 -> 5.0 (isolated), 4.7 (plugin) - PASS - Well-maintained recognition: 4.0 -> 5.0 (both modes) - PASS - Builder extraction: 4.7 -> 5.0 (both) - overhead penalty (baseline too strong) - Oversized tests: 5.0 -> 5.0 (both) - overhead penalty (baseline too strong) The skill's strongest value is on judgment calls (recognizing well-maintained tests, calibrating when NOT to recommend changes) rather than refactoring suggestions where the baseline LLM already excels.
|
/evaluate |
Skill Validation Results
[1] (Plugin) Quality improved but weighted score is -3.2% due to: tokens (13988 → 30984), tool calls (0 → 2), time (25.0s → 71.2s) Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
Replace pure-refactoring scenarios (where baseline scores 5.0) with mixed-quality scenarios that test selective judgment. Results: - Mixed-quality selective assessment: baseline 5.0 = skill 5.0 (overhead penalty) - Data-driven + display names: baseline 4.0 -> skill 4.7 (overhead penalty on plugin) - Well-maintained recognition: baseline 4.0 -> skill 5.0 - PASS - Oversized among well-sized: baseline 5.0 = skill 5.0 (overhead penalty)
|
Moving back to draft, it seems it will need more work/thoughts. |
…esign notes - Trim SKILL.md: remove Steps 4-5, validation checklist, and pitfalls table (~25% token reduction). Add calibration rule preferring DataRow+DisplayName over DynamicData for compile-time constants. - Remove 'Oversized tests' eval scenario (baseline at 5.0/5 ceiling). - Add design notes documenting evaluation rounds and rationale.
|
/evaluate |
Skill Validation Results
[1] (Plugin) Quality unchanged but weighted score is -11.0% due to: tokens (12942 → 29566), tool calls (0 → 2), time (10.4s → 31.1s), quality Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
|
/evaluate |
Skill Validation Results
[1] (Plugin) Quality unchanged but weighted score is -11.0% due to: tokens (13388 → 35555), tool calls (0 → 2), time (16.9s → 34.6s), quality Model: claude-opus-4.6 | Judge: claude-opus-4.6
|
- SKILL.md: Remove When to Use/Not to Use, Inputs, Steps 1-2 detection tables, Step 4 report format. Keep only calibration rules (~75% token reduction). - eval.yaml: Remove 'Selectively recommend changes' scenario (baseline at ceiling 5.0/5, cannot pass). Reword rubric items to test outcomes not skill vocabulary. Remove narrow 'builder pattern' assertion. - Design notes: Record rounds 3-4 results and decisions. Results: 2/2 scenarios pass, overfit 0.13 (green), quality +0.7-1.0/5.
|
/evaluate |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
plugins/dotnet-experimental/skills/exp-test-maintainability/SKILL.md:18
- These calibration rules recommend adding
DisplayNameand preferringDataRowoverDynamicData, which is correct for MSTest but doesn’t map cleanly to xUnit/NUnit/TUnit as-written. Consider rephrasing in framework-agnostic terms (e.g., “prefer inline/attribute-based test cases over source-based/dynamic case sources for compile-time constants”) and mention the framework-specific naming mechanisms so the skill doesn’t pushDisplayName/DataRowin non-MSTest test suites.
- **Display names matter most for non-obvious values.** `[DataRow("Gold", 100.0, 90.0)]` is self-explanatory. `[DataRow(3, 7, 42)]` is not — add `DisplayName`.
- **Prefer `[DataRow]` with `DisplayName` over `[DynamicData]`** when all values are compile-time constants. `[DataRow]` is simpler. Reserve `[DynamicData]` for computed or complex values.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Skill Validation Results
Model: claude-opus-4.6 | Judge: claude-opus-4.6
🔍 Full results — includes quality and agent details
|
Summary
New experimental skill that assesses test maintainability: duplication, test size, data-driven patterns, display names, builder/helper extraction, and shared setup.
Key features
[DataRow]/[Theory]or builder extractionDisplayNamefor non-obvious parameterized valuesEvaluation results (3 runs, claude-opus-4.6)
¹ Quality improved or matched but weighted score penalized by token overhead. The baseline LLM already excels at refactoring recommendations — the skill's strongest value is on judgment calls (recognizing well-maintained tests, calibrating when NOT to recommend changes).