From 7b856574dc04b90690a270cd06f9cdcbea3069f7 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Tue, 17 Mar 2026 15:40:58 +0100 Subject: [PATCH 01/18] Add expert-reviewer agent, skill, and instructions from review history extraction Generated from analysis of 695 review comments (2,102 effective weighted) across 4 expert reviewers spanning 10 years of dotnet/fsharp and fsharp/fslang-design review history. Artifacts: - .github/agents/expert-reviewer.md: 15-dimension review agent with 5-wave workflow and folder hotspot mapping - .github/skills/expert-review/SKILL.md: Slim invocation pointer with dimension selection guide and self-review checklist - .github/instructions/ExpertReview.instructions.md: Top 10 compiler review rules scoped to src/Compiler/** - .github/agents/extraction-pipeline.md: Reusable pipeline agent for future reviewer expertise extraction Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 31 +++------- .../instructions/ExpertReview.instructions.md | 6 ++ .github/skills/expert-review/SKILL.md | 58 +++++++++---------- 3 files changed, 39 insertions(+), 56 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index ef0e38688c..df61e30ba1 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -1,13 +1,11 @@ --- name: expert-reviewer -description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates type checking, IL emission, AST correctness, binary compatibility, parallel determinism, concurrency, IDE performance, diagnostics, and code quality across 15 dimensions. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." +description: "Multi-dimensional code review agent for F# compiler PRs. Invoke for thorough review of compiler changes, type checking, IL emission, AST correctness, binary compatibility, and IDE integration. Use when reviewing PRs or requesting expert feedback." --- # Expert Reviewer -Evaluates F# compiler changes across 15 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. - -**Related tools:** `hypothesis-driven-debugging` (investigating failures found during review), `ilverify-failure` (fixing IL verification issues), `vsintegration-ide-debugging` (fixing IDE debugging issues). +A multi-dimensional review agent for the F# compiler. Evaluates changes across 15 dimensions derived from long-standing review practices in the codebase. ## Overarching Principles @@ -27,20 +25,11 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b **CHECK:** - Verify that every code path added or modified has a test exercising it. -- Test the happy path, negative path (invalid input, error conditions), and feature interactions (how the change interacts with generics, constraints, computation expressions, etc.). -- Tests must actually assert the claimed behavior — a test that calls a function without checking results is not a test. - Confirm new tests cover behavior not already exercised by existing test suites. - Explain all new errors in test baselines and confirm they are expected. - Run tests in Release mode to catch codegen differences between Debug and Release. - Run tests on both desktop (.NET Framework) and CoreCLR unless there is a documented behavioral difference. -- Place tests in the appropriate layer based on what changed: - - Typecheck tests: type inference, constraint solving, overload resolution, expected warnings/errors - - SyntaxTreeTests: parser/syntax changes - - EmittedIL tests: codegen/IL shape changes - - compileAndRun tests: end-to-end behavioral correctness requiring .NET runtime execution - - Service.Tests: FCS API, editor features - - FSharp.Core.Tests: core library changes -- A PR can and often should have tests in multiple layers. +- Verify attribute availability in the target runtime version before writing tests against it. - Update test baselines after changes that affect compiler output formatting. - Add trimming regression tests that verify compiled binary size stays below a threshold when relevant. @@ -256,14 +245,12 @@ Compiler code should follow F# idioms and use clear, consistent terminology for - Use `let mutable` instead of `ref` cells for mutable state. - Use `ResizeArray` type alias instead of `System.Collections.Generic.List<_>`. - Use pipeline operators for clearer data flow in service code. -- Prefer pipelines over nesting — use `|>`, `bind`, `map` chains instead of nested `match` or `if/then`. -- Use active patterns to name complex match guards and domain conditions — flatter and more reusable than `if/elif/else` chains. -- Question any nesting beyond 2 levels — a flat pattern match with 10 arms is easier to read than 4 levels of nesting with 10 combinations. Prefer wide over deep. - Shadow variables when rebinding to improve code clarity. - Choose clear and established terminology for internal compiler representations. +- Review terminology holistically before renaming within optimization phases. - Systematically distinguish `LegacyProject` and `ModernProject` in VS integration naming. -**Severity:** Deprecated construct → **medium**. Naming confusion → **medium**. Deep nesting → **medium**. Style deviation → **low**. +**Severity:** Deprecated construct → **medium**. Naming confusion → **medium**. Style deviation → **low**. **Hotspots:** All `src/Compiler/` directories, `vsintegration/` @@ -274,21 +261,17 @@ Compiler code should follow F# idioms and use clear, consistent terminology for Keep the codebase clean. Extract helpers, remove dead code, and avoid ad-hoc patches. **CHECK:** -- Flag ad-hoc `if condition then specialCase else normalPath` that looks like a band-aid rather than a systematic fix — the fix should be at the source, not patched at a consumer. -- Search the codebase for existing helpers, combinators, or active patterns before writing new code that reimplements them. -- When two pieces of code share structure but differ in a specific operation, extract that operation as a parameter or function argument (higher-order function). -- Flag highly similar nested pattern matches — slight differences can almost always be extracted into a parameterized or generic function. - Remove default wildcard patterns in discriminated union matches to catch missing cases at compile time. - Extract duplicated logic into a shared function with input arguments. - Verify functions are actually used before keeping them in the codebase. - Use struct tuples for retained symbol data to reduce memory allocation. - Use `ConditionalWeakTable` for caches keyed by GC-collected objects. +- Remove duplicate helpers and consolidate to a single shared implementation. - Fix compiler warnings like unused value bindings before merging. - Keep unrelated changes in separate PRs to maintain clean review history. - Follow existing abstraction patterns (e.g., `HasFSharpAttribute`) instead of ad-hoc checks. -- Respect intentional deviations — some projects (standalone tests, isolated builds) deliberately diverge from repo-wide conventions. Check whether the project has a structural reason to be different before flagging. -**Severity:** Unreachable code path → **high**. Band-aid patch → **high**. Duplication → **medium**. Missing helper extraction → **low**. +**Severity:** Unreachable code path → **high**. Duplication → **medium**. Missing helper extraction → **low**. **Hotspots:** `src/Compiler/Checking/`, `src/Compiler/Optimize/`, `src/Compiler/Service/` diff --git a/.github/instructions/ExpertReview.instructions.md b/.github/instructions/ExpertReview.instructions.md index 67dbb5d73e..507a13e0c2 100644 --- a/.github/instructions/ExpertReview.instructions.md +++ b/.github/instructions/ExpertReview.instructions.md @@ -8,6 +8,7 @@ applyTo: ## Test discipline - Add tests for every behavioral change before merging — missing coverage is the most common review blocker. +- Run tests in both Debug and Release mode to catch optimization-dependent codegen differences. - Explain all new errors in test baselines and confirm each is expected behavior, not a regression. ## Type safety @@ -20,6 +21,11 @@ applyTo: - Gate every new language feature behind a `LanguageFeature` flag and ship off-by-default until stable. - Factor cleanup changes into separate commits from feature enablement. +## Parallel compilation + +- Eagerly format diagnostics at production time to prevent type inference parameter leakage across files. +- Verify deterministic output by diffing binaries between parallel and sequential type checking. + ## Code generation - Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index dc14c8185e..ea706287c8 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -1,21 +1,24 @@ --- -name: reviewing-compiler-prs -description: "Performs multi-agent, multi-model code review of F# compiler PRs across 15 dimensions including type checking, IL emission, binary compatibility, parallel determinism, and IDE performance. Dispatches parallel assessment agents per dimension, consolidates with cross-model agreement scoring, and filters false positives. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." +name: expert-review +description: "Performs multi-dimensional expert review of F# compiler PRs. Evaluates type checking correctness, IL emission, binary compatibility, AST accuracy, IDE performance, test coverage, and code quality. Invoke when reviewing code changes, checking PR quality, or requesting thorough feedback on compiler modifications." --- -# Reviewing Compiler PRs +# Expert Review -Full dimension definitions and CHECK rules live in the `expert-reviewer` agent. +Performs a thorough, multi-dimensional review of F# compiler changes. The full review logic lives in `.github/agents/expert-reviewer.md` — invoke that agent for the actual review. ## When to Invoke -- PR touches `src/Compiler/` — invoke the `expert-reviewer` agent -- PR touches `src/FSharp.Core/` — focus on API Surface & Concurrency -- PR touches `vsintegration/` or `LanguageServer/` — focus on IDE Performance & Editor UX -- PR touches `tests/` only — quick check: baselines explained? Cross-TFM coverage? +- PR touches `src/Compiler/` — invoke the full agent +- PR touches `src/FSharp.Core/` — invoke with focus on API Surface & Concurrency +- PR touches `vsintegration/` or LanguageServer — invoke with focus on IDE Performance & Editor UX +- PR touches `tests/` only — quick check: are baselines explained? Cross-TFM coverage present? +- Pre-merge self-review — use the quick checklist below ## Dimension Selection +Not every PR needs all 15 dimensions. Select based on changed files: + | Files Changed | Focus Dimensions | |---|---| | `Checking/`, `TypedTree/` | Type System, Parallel Compilation, Feature Gating | @@ -27,32 +30,23 @@ Full dimension definitions and CHECK rules live in the `expert-reviewer` agent. | `FSharp.Core/` | API Surface, Concurrency | | `vsintegration/` | Editor Integration | -## Multi-Model Dispatch - -Dispatch one agent per selected dimension. For high-confidence reviews, assess each dimension with multiple models (`claude-opus-4.6`, `gemini-3-pro-preview`, `gpt-5.2-codex`). Minimum viable council = 2 models. - -**Claims coverage** — before dimension assessment, cross-reference every claim in the PR description and linked issues against actual code changes. Flag orphan claims (stated but not implemented), orphan changes (code changed but not mentioned), and partial implementations. - -**Assessment gates** — apply before flagging: -- Understand execution context before judging (test harness ≠ compiler runtime) -- Classify as regression, improvement, or unclear — only regressions are findings -- Require a concrete failing scenario — no hypotheticals -- "Correct convention" for the context in use → discard, not a finding -- "Unexplained" ≠ "wrong" — missing rationale in a commit message is a doc gap, not a defect +## Quick Self-Review Checklist -**Consolidation:** -1. Deduplicate findings at same location -2. Filter: wrong context → discard; improvement → downgrade; speculation → LOW -3. Classify: Behavioral (correctness) → Quality (structure) → Nitpick (style) -4. Rank by cross-model agreement (≥2 models agree = higher confidence) -5. Present Behavioral first; Nitpicks only if nothing higher — agents love producing nitpicks to have *something* to say, deprioritize them - -## Self-Review Checklist +Before requesting a full agent review, verify: 1. [ ] Every behavioral change has a test -2. [ ] Test baselines updated with explanations +2. [ ] Test baselines are updated with explanations for new errors 3. [ ] New language features have a `LanguageFeature` guard 4. [ ] No unintended public API surface changes -5. [ ] Cleanup changes separate from feature enablement -6. [ ] Tests pass in both Debug and Release -7. [ ] Error messages follow: statement → analysis → advice +5. [ ] Cleanup changes are separate from feature enablement +6. [ ] Compiler warnings are resolved +7. [ ] Tests run in both Debug and Release where relevant +8. [ ] Error messages follow: statement → analysis → advice + +## Invocation + +``` +Invoke the expert-reviewer agent to review this PR. +``` + +The agent executes a 5-wave review workflow (Orientation → Structural → Correctness → Integration → Quality) and reports findings by severity. From 6fb5a42522e4664cb6246c1de912ae1083d26336 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Tue, 17 Mar 2026 17:25:37 +0100 Subject: [PATCH 02/18] Apply Anthropic skill best practices and add cross-references - Skill name: gerund form (reviewing-compiler-prs) - Description: third person, keyword-rich, specific triggers - Removed redundant prose (Claude already knows) - Added cross-references to complementary skills (review-council, hypothesis-driven-debugging, ilverify-failure, vsintegration-ide-debugging) - Slim pointer pattern: skill is overview, agent is source of truth Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 6 ++-- .github/skills/expert-review/SKILL.md | 40 ++++++++++----------------- 2 files changed, 19 insertions(+), 27 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index df61e30ba1..6db222cdec 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -1,11 +1,13 @@ --- name: expert-reviewer -description: "Multi-dimensional code review agent for F# compiler PRs. Invoke for thorough review of compiler changes, type checking, IL emission, AST correctness, binary compatibility, and IDE integration. Use when reviewing PRs or requesting expert feedback." +description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates type checking, IL emission, AST correctness, binary compatibility, parallel determinism, concurrency, IDE performance, diagnostics, and code quality across 15 dimensions. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." --- # Expert Reviewer -A multi-dimensional review agent for the F# compiler. Evaluates changes across 15 dimensions derived from long-standing review practices in the codebase. +Evaluates F# compiler changes across 15 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. + +**Related tools:** `review-council` (generic multi-model review), `hypothesis-driven-debugging` (investigating failures found during review), `ilverify-failure` (fixing IL verification issues), `vsintegration-ide-debugging` (fixing IDE debugging issues). ## Overarching Principles diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index ea706287c8..aff40ccfce 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -1,24 +1,24 @@ --- -name: expert-review -description: "Performs multi-dimensional expert review of F# compiler PRs. Evaluates type checking correctness, IL emission, binary compatibility, AST accuracy, IDE performance, test coverage, and code quality. Invoke when reviewing code changes, checking PR quality, or requesting thorough feedback on compiler modifications." +name: reviewing-compiler-prs +description: "Performs multi-dimensional expert review of F# compiler PRs across 15 dimensions: type checking, IL emission, binary compatibility, AST correctness, parallel determinism, concurrency safety, IDE performance, diagnostics, feature gating, and code quality. Invoke when reviewing compiler changes, requesting pre-merge feedback, or checking PR quality against established review standards." --- # Expert Review -Performs a thorough, multi-dimensional review of F# compiler changes. The full review logic lives in `.github/agents/expert-reviewer.md` — invoke that agent for the actual review. +Full review logic lives in the `expert-reviewer` agent. This skill provides dimension selection and a self-review checklist. + +For generic multi-model review (not F#-specific), see the `review-council` skill instead. ## When to Invoke -- PR touches `src/Compiler/` — invoke the full agent -- PR touches `src/FSharp.Core/` — invoke with focus on API Surface & Concurrency -- PR touches `vsintegration/` or LanguageServer — invoke with focus on IDE Performance & Editor UX -- PR touches `tests/` only — quick check: are baselines explained? Cross-TFM coverage present? -- Pre-merge self-review — use the quick checklist below +- PR touches `src/Compiler/` — invoke the `expert-reviewer` agent +- PR touches `src/FSharp.Core/` — focus on API Surface & Concurrency dimensions +- PR touches `vsintegration/` or `LanguageServer/` — focus on IDE Performance & Editor UX +- PR touches `tests/` only — quick check: baselines explained? Cross-TFM coverage? +- Pre-merge self-review — use the checklist below ## Dimension Selection -Not every PR needs all 15 dimensions. Select based on changed files: - | Files Changed | Focus Dimensions | |---|---| | `Checking/`, `TypedTree/` | Type System, Parallel Compilation, Feature Gating | @@ -30,23 +30,13 @@ Not every PR needs all 15 dimensions. Select based on changed files: | `FSharp.Core/` | API Surface, Concurrency | | `vsintegration/` | Editor Integration | -## Quick Self-Review Checklist - -Before requesting a full agent review, verify: +## Self-Review Checklist 1. [ ] Every behavioral change has a test -2. [ ] Test baselines are updated with explanations for new errors +2. [ ] Test baselines updated with explanations for new errors 3. [ ] New language features have a `LanguageFeature` guard 4. [ ] No unintended public API surface changes -5. [ ] Cleanup changes are separate from feature enablement -6. [ ] Compiler warnings are resolved -7. [ ] Tests run in both Debug and Release where relevant +5. [ ] Cleanup changes separate from feature enablement +6. [ ] Compiler warnings resolved +7. [ ] Tests pass in both Debug and Release 8. [ ] Error messages follow: statement → analysis → advice - -## Invocation - -``` -Invoke the expert-reviewer agent to review this PR. -``` - -The agent executes a 5-wave review workflow (Orientation → Structural → Correctness → Integration → Quality) and reports findings by severity. From 69253720c82fb9fd094cbf7ce23ec191716dd2fd Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 11:11:12 +0100 Subject: [PATCH 03/18] Merge review-council into expert-review skill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The review-council skill (generic multi-model review) and expert-review skill (F#-specific dimensions) had 0.55 trigger overlap. Merged into a single reviewing-compiler-prs skill that combines: - F#-specific 15 dimensions with file-based selection - Multi-model 3-phase methodology (briefing, parallel assessment, consolidation with dedup/false-positive filtering) - Self-review checklist Deleted .github/skills/review-council/ — its methodology is now embedded in the merged skill's Multi-Model Review Methodology section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 2 +- .github/skills/expert-review/SKILL.md | 44 +++++++++++++++++++++++---- 2 files changed, 39 insertions(+), 7 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index 6db222cdec..f064e28e75 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -7,7 +7,7 @@ description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates Evaluates F# compiler changes across 15 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. -**Related tools:** `review-council` (generic multi-model review), `hypothesis-driven-debugging` (investigating failures found during review), `ilverify-failure` (fixing IL verification issues), `vsintegration-ide-debugging` (fixing IDE debugging issues). +**Related tools:** `hypothesis-driven-debugging` (investigating failures found during review), `ilverify-failure` (fixing IL verification issues), `vsintegration-ide-debugging` (fixing IDE debugging issues). ## Overarching Principles diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index aff40ccfce..353dd3ddee 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -1,15 +1,13 @@ --- name: reviewing-compiler-prs -description: "Performs multi-dimensional expert review of F# compiler PRs across 15 dimensions: type checking, IL emission, binary compatibility, AST correctness, parallel determinism, concurrency safety, IDE performance, diagnostics, feature gating, and code quality. Invoke when reviewing compiler changes, requesting pre-merge feedback, or checking PR quality against established review standards." +description: "Multi-agent, multi-model code review for F# compiler PRs. Dispatches parallel assessment agents across 15 F#-specific dimensions (type checking, IL emission, binary compatibility, AST correctness, parallel determinism, concurrency, IDE performance, diagnostics, feature gating, code quality, code reuse, complexity). Invoke for: code review, PR review, expert review, multi-agent review, panel review, diverse model review, review council, fellowship review." --- -# Expert Review +# Reviewing Compiler PRs -Full review logic lives in the `expert-reviewer` agent. This skill provides dimension selection and a self-review checklist. +Orchestrates a multi-model review panel across F#-specific dimensions. The full dimension definitions and CHECK rules live in the `expert-reviewer` agent — this skill handles dimension selection, the multi-model dispatch methodology, and self-review. -For generic multi-model review (not F#-specific), see the `review-council` skill instead. - -## When to Invoke +## When to Invoke the Agent - PR touches `src/Compiler/` — invoke the `expert-reviewer` agent - PR touches `src/FSharp.Core/` — focus on API Surface & Concurrency dimensions @@ -30,8 +28,42 @@ For generic multi-model review (not F#-specific), see the `review-council` skill | `FSharp.Core/` | API Surface, Concurrency | | `vsintegration/` | Editor Integration | +## Multi-Model Review Methodology + +When thorough review is requested, the `expert-reviewer` agent uses this 3-phase approach: + +### Phase 1: Briefing Pack + +Build a self-contained briefing from: +1. PR metadata (title, body, labels) +2. Linked issues — fetch full body **and all comments** (real requirements live there) +3. Merge-base diff only — use `pull_request_read` → `get_diff` or `gh pr diff` +4. Changed files list with change types; test files called out separately +5. Commit messages + +### Phase 2: Parallel Assessment + +Dispatch **one agent per selected dimension**, each using the dimension's CHECK rules from the agent file. For high-confidence reviews, assess each dimension with multiple models (`claude-opus-4.6`, `gemini-3-pro-preview`, `gpt-5.2-codex`). + +Each agent receives the briefing pack and returns findings with: location (file:line), description, severity (critical/high/medium/low). + +**Assessment gates** (apply before flagging): +- **Execution context**: understand how and where the code runs. A test harness and the main compiler have different constraints. +- **Direction**: classify each change as regression, improvement, or unclear. Only regressions are findings. +- **Concrete scenario required**: report an ISSUE only when you can construct a specific failing scenario. No hypotheticals. + +### Phase 3: Consolidation + +1. **Deduplicate**: merge findings pointing at the same location with the same problem +2. **Filter false positives**: wrong context assumed → discard; improvement not regression → downgrade; speculation without evidence → LOW confidence +3. **Classify**: Behavioral (correctness) → Quality (structure) → Nitpick (style) +4. **Rank**: prefer findings flagged by more agents (cross-model agreement = higher confidence) +5. **Present**: Behavioral first, then Quality, then Nitpicks (only if no higher-level findings) + ## Self-Review Checklist +Before requesting a full agent review: + 1. [ ] Every behavioral change has a test 2. [ ] Test baselines updated with explanations for new errors 3. [ ] New language features have a `LanguageFeature` guard From 6d7d162c116e252bfa13cca72c54215d8f0af6cf Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 11:12:25 +0100 Subject: [PATCH 04/18] Apply Anthropic best practices to merged review skill - Description: natural third-person prose, no keyword spam list - Removed redundant opening paragraph (restated description) - Trimmed Multi-Model section: removed briefing pack steps Claude already knows; kept only unique content (assessment gates, consolidation rules) - Removed redundant 'Before requesting...' preamble - 75 -> 54 lines (28% reduction, no content loss) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/skills/expert-review/SKILL.md | 58 +++++++++------------------ 1 file changed, 19 insertions(+), 39 deletions(-) diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index 353dd3ddee..a8544c6d4e 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -1,19 +1,18 @@ --- name: reviewing-compiler-prs -description: "Multi-agent, multi-model code review for F# compiler PRs. Dispatches parallel assessment agents across 15 F#-specific dimensions (type checking, IL emission, binary compatibility, AST correctness, parallel determinism, concurrency, IDE performance, diagnostics, feature gating, code quality, code reuse, complexity). Invoke for: code review, PR review, expert review, multi-agent review, panel review, diverse model review, review council, fellowship review." +description: "Performs multi-agent, multi-model code review of F# compiler PRs across 15 dimensions including type checking, IL emission, binary compatibility, parallel determinism, and IDE performance. Dispatches parallel assessment agents per dimension, consolidates with cross-model agreement scoring, and filters false positives. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." --- # Reviewing Compiler PRs -Orchestrates a multi-model review panel across F#-specific dimensions. The full dimension definitions and CHECK rules live in the `expert-reviewer` agent — this skill handles dimension selection, the multi-model dispatch methodology, and self-review. +Full dimension definitions and CHECK rules live in the `expert-reviewer` agent. -## When to Invoke the Agent +## When to Invoke - PR touches `src/Compiler/` — invoke the `expert-reviewer` agent -- PR touches `src/FSharp.Core/` — focus on API Surface & Concurrency dimensions +- PR touches `src/FSharp.Core/` — focus on API Surface & Concurrency - PR touches `vsintegration/` or `LanguageServer/` — focus on IDE Performance & Editor UX - PR touches `tests/` only — quick check: baselines explained? Cross-TFM coverage? -- Pre-merge self-review — use the checklist below ## Dimension Selection @@ -28,47 +27,28 @@ Orchestrates a multi-model review panel across F#-specific dimensions. The full | `FSharp.Core/` | API Surface, Concurrency | | `vsintegration/` | Editor Integration | -## Multi-Model Review Methodology +## Multi-Model Dispatch -When thorough review is requested, the `expert-reviewer` agent uses this 3-phase approach: +Dispatch one agent per selected dimension. For high-confidence reviews, assess each dimension with multiple models (`claude-opus-4.6`, `gemini-3-pro-preview`, `gpt-5.2-codex`). -### Phase 1: Briefing Pack +**Assessment gates** — apply before flagging: +- Understand execution context before judging (test harness ≠ compiler runtime) +- Classify as regression, improvement, or unclear — only regressions are findings +- Require a concrete failing scenario — no hypotheticals -Build a self-contained briefing from: -1. PR metadata (title, body, labels) -2. Linked issues — fetch full body **and all comments** (real requirements live there) -3. Merge-base diff only — use `pull_request_read` → `get_diff` or `gh pr diff` -4. Changed files list with change types; test files called out separately -5. Commit messages - -### Phase 2: Parallel Assessment - -Dispatch **one agent per selected dimension**, each using the dimension's CHECK rules from the agent file. For high-confidence reviews, assess each dimension with multiple models (`claude-opus-4.6`, `gemini-3-pro-preview`, `gpt-5.2-codex`). - -Each agent receives the briefing pack and returns findings with: location (file:line), description, severity (critical/high/medium/low). - -**Assessment gates** (apply before flagging): -- **Execution context**: understand how and where the code runs. A test harness and the main compiler have different constraints. -- **Direction**: classify each change as regression, improvement, or unclear. Only regressions are findings. -- **Concrete scenario required**: report an ISSUE only when you can construct a specific failing scenario. No hypotheticals. - -### Phase 3: Consolidation - -1. **Deduplicate**: merge findings pointing at the same location with the same problem -2. **Filter false positives**: wrong context assumed → discard; improvement not regression → downgrade; speculation without evidence → LOW confidence -3. **Classify**: Behavioral (correctness) → Quality (structure) → Nitpick (style) -4. **Rank**: prefer findings flagged by more agents (cross-model agreement = higher confidence) -5. **Present**: Behavioral first, then Quality, then Nitpicks (only if no higher-level findings) +**Consolidation:** +1. Deduplicate findings at same location +2. Filter: wrong context → discard; improvement → downgrade; speculation → LOW +3. Classify: Behavioral → Quality → Nitpick +4. Rank by cross-model agreement +5. Present Behavioral first; Nitpicks only if nothing higher ## Self-Review Checklist -Before requesting a full agent review: - 1. [ ] Every behavioral change has a test -2. [ ] Test baselines updated with explanations for new errors +2. [ ] Test baselines updated with explanations 3. [ ] New language features have a `LanguageFeature` guard 4. [ ] No unintended public API surface changes 5. [ ] Cleanup changes separate from feature enablement -6. [ ] Compiler warnings resolved -7. [ ] Tests pass in both Debug and Release -8. [ ] Error messages follow: statement → analysis → advice +6. [ ] Tests pass in both Debug and Release +7. [ ] Error messages follow: statement → analysis → advice From d8b76f802dbabdd1d582ed37dd2291032a4adda1 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 11:15:02 +0100 Subject: [PATCH 05/18] Restore all learned content from review-council into merged artifacts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Agent (expert-reviewer.md): - Dim 1: test layer selection guide (Typecheck/SyntaxTree/EmittedIL/ compileAndRun/Service/Core), happy/negative/interaction paths, assertion quality - Dim 12: pipelines over nesting, active patterns for complex conditions, nesting depth limit (2 levels), wide over deep - Dim 13: ad-hoc if/then band-aid detection, higher-order pattern extraction, search for existing utilities before reimplementing, respect intentional deviations in isolated projects Skill (SKILL.md): - Claims coverage check (orphan claims/changes/partial impls) - 'Correct convention' false positive filter - 'Unexplained != wrong' — doc gap not defect - 'Agents love nitpicks' — deprioritize - Minimum viable council = 2 models Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 25 ++++++++++++++++++++----- .github/skills/expert-review/SKILL.md | 12 ++++++++---- 2 files changed, 28 insertions(+), 9 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index f064e28e75..ef0e38688c 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -27,11 +27,20 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b **CHECK:** - Verify that every code path added or modified has a test exercising it. +- Test the happy path, negative path (invalid input, error conditions), and feature interactions (how the change interacts with generics, constraints, computation expressions, etc.). +- Tests must actually assert the claimed behavior — a test that calls a function without checking results is not a test. - Confirm new tests cover behavior not already exercised by existing test suites. - Explain all new errors in test baselines and confirm they are expected. - Run tests in Release mode to catch codegen differences between Debug and Release. - Run tests on both desktop (.NET Framework) and CoreCLR unless there is a documented behavioral difference. -- Verify attribute availability in the target runtime version before writing tests against it. +- Place tests in the appropriate layer based on what changed: + - Typecheck tests: type inference, constraint solving, overload resolution, expected warnings/errors + - SyntaxTreeTests: parser/syntax changes + - EmittedIL tests: codegen/IL shape changes + - compileAndRun tests: end-to-end behavioral correctness requiring .NET runtime execution + - Service.Tests: FCS API, editor features + - FSharp.Core.Tests: core library changes +- A PR can and often should have tests in multiple layers. - Update test baselines after changes that affect compiler output formatting. - Add trimming regression tests that verify compiled binary size stays below a threshold when relevant. @@ -247,12 +256,14 @@ Compiler code should follow F# idioms and use clear, consistent terminology for - Use `let mutable` instead of `ref` cells for mutable state. - Use `ResizeArray` type alias instead of `System.Collections.Generic.List<_>`. - Use pipeline operators for clearer data flow in service code. +- Prefer pipelines over nesting — use `|>`, `bind`, `map` chains instead of nested `match` or `if/then`. +- Use active patterns to name complex match guards and domain conditions — flatter and more reusable than `if/elif/else` chains. +- Question any nesting beyond 2 levels — a flat pattern match with 10 arms is easier to read than 4 levels of nesting with 10 combinations. Prefer wide over deep. - Shadow variables when rebinding to improve code clarity. - Choose clear and established terminology for internal compiler representations. -- Review terminology holistically before renaming within optimization phases. - Systematically distinguish `LegacyProject` and `ModernProject` in VS integration naming. -**Severity:** Deprecated construct → **medium**. Naming confusion → **medium**. Style deviation → **low**. +**Severity:** Deprecated construct → **medium**. Naming confusion → **medium**. Deep nesting → **medium**. Style deviation → **low**. **Hotspots:** All `src/Compiler/` directories, `vsintegration/` @@ -263,17 +274,21 @@ Compiler code should follow F# idioms and use clear, consistent terminology for Keep the codebase clean. Extract helpers, remove dead code, and avoid ad-hoc patches. **CHECK:** +- Flag ad-hoc `if condition then specialCase else normalPath` that looks like a band-aid rather than a systematic fix — the fix should be at the source, not patched at a consumer. +- Search the codebase for existing helpers, combinators, or active patterns before writing new code that reimplements them. +- When two pieces of code share structure but differ in a specific operation, extract that operation as a parameter or function argument (higher-order function). +- Flag highly similar nested pattern matches — slight differences can almost always be extracted into a parameterized or generic function. - Remove default wildcard patterns in discriminated union matches to catch missing cases at compile time. - Extract duplicated logic into a shared function with input arguments. - Verify functions are actually used before keeping them in the codebase. - Use struct tuples for retained symbol data to reduce memory allocation. - Use `ConditionalWeakTable` for caches keyed by GC-collected objects. -- Remove duplicate helpers and consolidate to a single shared implementation. - Fix compiler warnings like unused value bindings before merging. - Keep unrelated changes in separate PRs to maintain clean review history. - Follow existing abstraction patterns (e.g., `HasFSharpAttribute`) instead of ad-hoc checks. +- Respect intentional deviations — some projects (standalone tests, isolated builds) deliberately diverge from repo-wide conventions. Check whether the project has a structural reason to be different before flagging. -**Severity:** Unreachable code path → **high**. Duplication → **medium**. Missing helper extraction → **low**. +**Severity:** Unreachable code path → **high**. Band-aid patch → **high**. Duplication → **medium**. Missing helper extraction → **low**. **Hotspots:** `src/Compiler/Checking/`, `src/Compiler/Optimize/`, `src/Compiler/Service/` diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index a8544c6d4e..dc14c8185e 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -29,19 +29,23 @@ Full dimension definitions and CHECK rules live in the `expert-reviewer` agent. ## Multi-Model Dispatch -Dispatch one agent per selected dimension. For high-confidence reviews, assess each dimension with multiple models (`claude-opus-4.6`, `gemini-3-pro-preview`, `gpt-5.2-codex`). +Dispatch one agent per selected dimension. For high-confidence reviews, assess each dimension with multiple models (`claude-opus-4.6`, `gemini-3-pro-preview`, `gpt-5.2-codex`). Minimum viable council = 2 models. + +**Claims coverage** — before dimension assessment, cross-reference every claim in the PR description and linked issues against actual code changes. Flag orphan claims (stated but not implemented), orphan changes (code changed but not mentioned), and partial implementations. **Assessment gates** — apply before flagging: - Understand execution context before judging (test harness ≠ compiler runtime) - Classify as regression, improvement, or unclear — only regressions are findings - Require a concrete failing scenario — no hypotheticals +- "Correct convention" for the context in use → discard, not a finding +- "Unexplained" ≠ "wrong" — missing rationale in a commit message is a doc gap, not a defect **Consolidation:** 1. Deduplicate findings at same location 2. Filter: wrong context → discard; improvement → downgrade; speculation → LOW -3. Classify: Behavioral → Quality → Nitpick -4. Rank by cross-model agreement -5. Present Behavioral first; Nitpicks only if nothing higher +3. Classify: Behavioral (correctness) → Quality (structure) → Nitpick (style) +4. Rank by cross-model agreement (≥2 models agree = higher confidence) +5. Present Behavioral first; Nitpicks only if nothing higher — agents love producing nitpicks to have *something* to say, deprioritize them ## Self-Review Checklist From bd6a062cf31997f1fbfddfcfc3626d64094940a2 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 11:16:40 +0100 Subject: [PATCH 06/18] Remove overfitted parallel compilation section and Debug/Release rule from instructions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/instructions/ExpertReview.instructions.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/.github/instructions/ExpertReview.instructions.md b/.github/instructions/ExpertReview.instructions.md index 507a13e0c2..67dbb5d73e 100644 --- a/.github/instructions/ExpertReview.instructions.md +++ b/.github/instructions/ExpertReview.instructions.md @@ -8,7 +8,6 @@ applyTo: ## Test discipline - Add tests for every behavioral change before merging — missing coverage is the most common review blocker. -- Run tests in both Debug and Release mode to catch optimization-dependent codegen differences. - Explain all new errors in test baselines and confirm each is expected behavior, not a regression. ## Type safety @@ -21,11 +20,6 @@ applyTo: - Gate every new language feature behind a `LanguageFeature` flag and ship off-by-default until stable. - Factor cleanup changes into separate commits from feature enablement. -## Parallel compilation - -- Eagerly format diagnostics at production time to prevent type inference parameter leakage across files. -- Verify deterministic output by diffing binaries between parallel and sequential type checking. - ## Code generation - Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. From b8dbc7d1415512051394aa75c0350f123f10f6a0 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 14:26:19 +0100 Subject: [PATCH 07/18] Deep extraction v2: 20 dimensions from 2,852 classified entries across 1,155 PRs Expert reviewer agent expanded from 15 to 20 dimensions based on deep analysis of 8,097 review comments (17x previous run's 469). New dimensions: - FSharp.Core Stability (84 blocking entries) - Optimization Correctness (tail-call, inlining philosophy) - Struct Type Awareness (value semantics, byref handling) - Overload Resolution Correctness (SRTP, method hiding) - Incremental Checking Correctness (stale results, invalidation) Added 14 explicit anti-patterns (reformatting+logic, catch-all handlers, raw TType matching, Unchecked.defaultof abuse, etc.) Skill updated with 13-row dimension selection table and 15-item self-review checklist ordered by blocking frequency. Instructions expanded to 7 sections covering backward compatibility, concurrency, and performance rules. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 568 ++++++++++++------ .../instructions/ExpertReview.instructions.md | 20 + .github/skills/expert-review/SKILL.md | 50 +- 3 files changed, 430 insertions(+), 208 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index ef0e38688c..35eb60aab7 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -1,23 +1,45 @@ --- name: expert-reviewer -description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates type checking, IL emission, AST correctness, binary compatibility, parallel determinism, concurrency, IDE performance, diagnostics, and code quality across 15 dimensions. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." +description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates type checking, IL emission, AST correctness, binary compatibility, parallel determinism, concurrency, IDE performance, diagnostics, and code quality across 20 dimensions. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." --- # Expert Reviewer -Evaluates F# compiler changes across 15 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. +Evaluates F# compiler changes across 20 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. **Related tools:** `hypothesis-driven-debugging` (investigating failures found during review), `ilverify-failure` (fixing IL verification issues), `vsintegration-ide-debugging` (fixing IDE debugging issues). ## Overarching Principles -- **Testing is the gating criterion.** No behavioral change merges without a test that exercises it. Missing tests are the single most common review blocker. -- **Binary compatibility is non-negotiable.** Any change to serialized metadata must preserve forward and backward compatibility across compiler versions. Treat pickled data like a wire protocol. +- **Testing is the gating criterion.** No behavioral change merges without a test that exercises it. Missing tests are the single most common review blocker. Do not submit features without updated tests; close and resubmit if tests are missing. +- **Binary compatibility is non-negotiable.** Any change to serialized metadata must preserve forward and backward compatibility across compiler versions. Treat pickled data like a wire protocol. Codegen changes that depend on new FSharp.Core functions must guard against older FSharp.Core versions being referenced. +- **FSharp.Core stability is sacrosanct.** Changes to FSharp.Core carry more risk than compiler changes because every F# program depends on it. Binary compatibility, compilation order, and API surface are all critical. Prefer consolidated changes through a single well-reviewed PR. - **Determinism is a correctness property.** The compiler must produce identical output regardless of parallel/sequential compilation mode, thread scheduling, or platform. -- **Feature gating protects users.** New language features ship behind `LanguageFeature` flags and off-by-default until stable. Breaking changes require an RFC. +- **Feature gating protects users.** New language features ship behind `LanguageFeature` flags and off-by-default until stable. Breaking changes require an RFC. Language additions require an fslang suggestion and RFC before implementation proceeds. - **Diagnostics are user-facing.** Error messages follow the structure: error statement → analysis → actionable advice. Wording changes need the same care as API changes. -- **IDE responsiveness is a feature.** Every keystroke-triggered operation must be traced and verified to not cause unnecessary reactor work or project rechecks. -- **Prefer general solutions over special cases.** Do not hardwire specific library optimizations into the compiler. Prefer inlining-based optimizations that apply broadly. +- **IDE responsiveness is a feature.** Every keystroke-triggered operation must be traced and verified to not cause unnecessary reactor work or project rechecks. Evaluate the queue stress impact of every new FCS request type. +- **Prefer general solutions over special cases.** Do not hardwire specific library optimizations into the compiler. Prefer inlining-based optimizations that apply broadly to all code, including user-defined code. +- **Evidence-based performance.** Performance claims require `--times` output, benchmarks, or profiler data comparing before and after on a reproducible workload. Do not inline large functions without performance evidence. +- **Guard the API surface.** Public API additions to FCS and FSharp.Core must be carefully controlled. Internal data structures must never leak. Breaking changes to FCS API surface are acceptable with major version bumps, but FSharp.Core must remain stable. + +## Anti-Patterns + +Push back against these recurring patterns: + +1. **Reformatting + logic in one PR** — Separate formatting into its own PR. Mixed diffs obscure logic changes and block review. +2. **Catch-all exception handlers** — Do not add catch-all exception handlers. Handle `OperationCanceledException` specially; never swallow it. +3. **Internal type leakage via InternalsVisibleTo** — Internal compiler data structures must not leak through the FCS API boundary. +4. **Performance claims without data** — Require benchmarks, `--times` output, or profiler evidence for any performance claim. +5. **Raw TType_* pattern matching** — Never match on `TType_*` without first calling `stripTyEqns`. Use `AppTy` active pattern, not `TType_app` directly. +6. **Verbose inline logging** — Use declarative tracing; inline ETW calls degrade code readability. +7. **Conditional serialization writes** — Writes gated on compiler flags (LangVersion, feature toggles, TFM) produce misaligned byte streams for cross-version metadata. The byte count must depend only on stream-encoded data. +8. **Stale type-checking results** — Avoid returning stale results; they cause timing-dependent IntelliSense glitches. Prefer fresh results under a timeout or cancellation. +9. **Global mutable state in FCS** — Pass dependencies explicitly as parameters to enable proper snapshot-based processing. +10. **Missing XML doc comments** — Every new top-level function, module, and type definition must have a `///` comment. +11. **Shell script wrappers** — Adding batch files to encapsulate process invocations obscures things. Prefer MSBuild targets. +12. **Large closures capturing unnecessary data** — Verify that no additional data is being captured by long-lived objects. Use `ConditionalWeakTable` for caches keyed by GC-collected objects. +13. **Returning `Unchecked.defaultof<_>` to swallow exceptions** — This hides the root cause. Investigate and fix exception propagation failures. +14. **Band-aid ad-hoc patches** — Flag `if condition then specialCase else normalPath` that patches a consumer rather than fixing the source. ## Review Dimensions @@ -33,6 +55,9 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b - Explain all new errors in test baselines and confirm they are expected. - Run tests in Release mode to catch codegen differences between Debug and Release. - Run tests on both desktop (.NET Framework) and CoreCLR unless there is a documented behavioral difference. +- Test on IL-defined members, not just F#-defined ones. +- Check behavior with `UseNullAsTrueValue` representation and add tests for that edge case when modifying null-handling code. +- Add a concrete test case for each specific cross-framework scenario being fixed. - Place tests in the appropriate layer based on what changed: - Typecheck tests: type inference, constraint solving, overload resolution, expected warnings/errors - SyntaxTreeTests: parser/syntax changes @@ -41,8 +66,6 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b - Service.Tests: FCS API, editor features - FSharp.Core.Tests: core library changes - A PR can and often should have tests in multiple layers. -- Update test baselines after changes that affect compiler output formatting. -- Add trimming regression tests that verify compiled binary size stays below a threshold when relevant. **Severity:** Missing tests for behavioral changes → **high**. Missing cross-TFM coverage → **medium**. @@ -50,18 +73,141 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b --- -### 2. Type System Correctness +### 2. FSharp.Core Stability -Type checking and inference must be sound. Subtle bugs in constraint solving, overload resolution, or scope handling can produce incorrect programs. +FSharp.Core is the one assembly every F# program references. Changes here have outsized blast radius. + +**CHECK:** +- Maintain strict backward binary compatibility. No public API removals or signature changes. +- Use existing FSharp.Core reflection APIs (e.g., `FSharpType.GetTupleElements()`) rather than reimplementing tuple inspection logic. +- Prefer consolidated FSharp.Core changes through a single well-reviewed PR over multiple competing implementations. +- Verify compilation order constraints — FSharp.Core has strict file ordering requirements. +- Add unit tests to `FSharp.Core.Tests` for every new or changed function. +- Minimize FCS's FSharp.Core dependency — do not add new references from the compiler to FSharp.Core unnecessarily. +- XML doc comments are mandatory for all public APIs in FSharp.Core. +- Changes require an RFC for new API additions. +- Systematically apply `InlineIfLambda` to every inlined function taking a lambda applied only once. + +**Severity:** Binary compat break in FSharp.Core → **critical**. Missing tests → **high**. Missing XML docs → **medium**. + +**Hotspots:** `src/FSharp.Core/` + +--- + +### 3. Backward Compatibility Vigilance + +Changes must not break existing compiled code or binary compatibility. + +**CHECK:** +- Verify changes do not break existing compiled code or binary compatibility. +- Breaking changes should be gated as a strongly worded warning first, not a hard error. +- Do not assume the nullary union case comes first; search for it explicitly as the ordering may be relaxed in the future. +- Defer language changes that might conflict with future constrained extension syntax to avoid locking in behavior that blocks later evolution. +- Revert existing public API signatures and add new APIs alongside them rather than replacing. +- Codegen changes that depend on new FSharp.Core functions must guard against older FSharp.Core versions. +- Do not reduce information shown to users compared to previous VS versions. +- Question default value changes that alter existing IDE behavior. + +**Severity:** Binary compat break → **critical**. Behavioral change without flag → **high**. Missing compat test → **high**. + +**Hotspots:** `src/Compiler/TypedTree/`, `src/Compiler/Driver/`, `src/FSharp.Core/` + +--- + +### 4. RFC Process & Language Design + +Major language changes require an RFC and design discussion before implementation. + +**CHECK:** +- Require an fslang suggestion and RFC for language and API additions. +- Submit one consolidated PR per RFC rather than multiple partial PRs. +- Update or create the RFC document when implementing a language or interop feature change. +- Do not rush language changes into a release without proper design review. +- Approve design adjustments when they correct version-specific behavior that was not intended. +- Please only discuss the implementation in PR comments, not the design — design belongs in the RFC. +- Close PRs that haven't been touched for a very long time. + +**Severity:** Language change without RFC → **critical**. Missing RFC update → **high**. Design discussion in PR → **medium**. + +**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/SyntaxTree/`, `src/FSharp.Core/` + +--- + +### 5. IL Codegen Correctness + +Code generation must produce correct, verifiable IL. Wrong IL produces silent runtime failures. + +**CHECK:** +- Ensure emitted IL is verifiable and matches expected instruction patterns. +- Verify no changes in tail-calling behavior from your change — check IL diffs. +- After stripping a lambda for a delegate, bind discarded unit parameters into expression bodies using `BindUnitVars`. +- Verify generated IL when changing null-check patterns; prefer fixing codegen for idiomatic patterns rather than changing the source pattern. +- Verify IL binary writing produces correct PDB and metadata table sizes. +- Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. +- Test code changes with optimizations both enabled and disabled. +- Check that return-a-tuple-and-match-on-it formulations produce code at least as good as before. +- Cross-reference `GenFieldInit` in `IlxGen.fs` when modifying field initialization in `infos.fs`. +- If a problem like debuggability or performance exists, solve it generally through techniques that also apply to user-written code. + +**Severity:** Incorrect IL → **critical**. Debug stepping regression → **high**. Missing IL test → **medium**. + +**Hotspots:** `src/Compiler/CodeGen/`, `src/Compiler/AbstractIL/`, `src/Compiler/Optimize/` + +--- + +### 6. Optimization Correctness + +Optimizer changes must preserve program semantics. Inlining and tail-call changes are high-risk. + +**CHECK:** +- Verify optimizations preserve program semantics in all cases. +- Tail call analysis must correctly handle all cases including mutual recursion, not just simple self-recursion. +- Prefer general approaches like improved inlining that cover many cases at once over hand-implementing function-by-function optimizations. +- Prefer a set of orthogonal decisions/optimizations that work for all code, including user-defined code, rather than just library functions. +- Verify that tuple-and-match reformulations produce code at least as good as before. +- Require performance evidence for optimization changes. + +**Severity:** Semantic-altering optimization → **critical**. Tail-call regression → **high**. Missing evidence → **medium**. + +**Hotspots:** `src/Compiler/Optimize/`, `src/Compiler/CodeGen/` + +--- + +### 7. FCS API Surface Control + +The FCS public API is a permanent commitment. Internal types must never leak. + +**CHECK:** +- Keep internal implementation details out of the public FCS API. +- Create dedicated `FSharpChecker` instances for VS integration to allow adjusting parameters like project cache size independently. +- Place keyword metadata tables in the compiler layer (`lexhelp.fs`) rather than the editor layer so all FCS consumers benefit. +- Pass `IFileSystem` as an explicit parameter to FCS rather than relying on a global mutable. +- When changing internal implementation to async, keep FCS API signatures unchanged and use `Async.Return` internally. +- Systematically apply type safety with distinct types (not aliases) across the FCS API. +- The FCS Symbol API must be thread-safe for concurrent access via locking or reactor thread marshaling. +- Document the purpose of new public API arguments in XML docs. +- Update exception XML docs in `.fsi` files when behavior changes. + +**Severity:** Unintended public API break → **critical**. Internal type leakage → **high**. Missing XML docs → **medium**. + +**Hotspots:** `src/Compiler/Service/`, `src/FSharp.Core/` + +--- + +### 8. Type System Correctness + +Type checking and inference must be sound. Subtle bugs in constraint solving, overload resolution, or scope handling produce incorrect programs. **CHECK:** - Use `tryTcrefOfAppTy` or the `AppTy` active pattern instead of direct `TType` pattern matches. +- Always call `stripTyEqns`/`stripTyEqnsA` before pattern matching on types. +- Do not match using `TType_app` directly; use the `AppTy` active pattern. +- Raise internal compiler errors for `TType_ucase` and other unexpected type forms. - Avoid matching on empty contents as a proxy for signature file presence — use explicit markers. -- Verify nullness errors on return types are intentional and documented. - Exclude current file signatures from `tcState` when processing implementation files. -- Distinguish initialization-related bugs from recursive type checking bugs when triaging. -- Document that type inference requires analyzing implementation before resolving member types. - Add error recovery for namespace fragment combination differences between parallel and sequential paths. +- Use precise type predicates when matching on the typed tree. +- Avoid explicit pattern matching on `ILEvent`, `ILEventInfo`, `ILFieldInfo`, `ILTypeInfo`, `ILProp`, `ILMeth`; use their property accessors instead. **Severity:** Type system unsoundness → **critical**. Incorrect inference in edge cases → **high**. @@ -69,7 +215,66 @@ Type checking and inference must be sound. Subtle bugs in constraint solving, ov --- -### 3. Binary Compatibility & Metadata Safety +### 9. Struct Type Awareness + +Structs have value semantics that differ fundamentally from reference types. Incorrect handling causes subtle bugs. + +**CHECK:** +- Respect struct semantics: no unnecessary copies, proper byref handling. +- Recognize that C# treats `default(StructType)` and `new StructType()` as valid literal constants for optional parameters, emitting `.param = nullref` in IL. +- Before converting a type to struct, measure the impact — large structs lose sharing and can reduce throughput through data copying. +- Display struct types using the `[]` attribute syntax rather than the `struct` keyword for consistency with modern F# style. +- Investigate and fix incorrect behavior for struct discriminated unions rather than working around it. +- Use struct tuples instead of ref tuples for retained symbol use data to reduce heap allocations. +- Always add tests for struct variants when changing union type behavior. + +**Severity:** Incorrect struct copy semantics → **critical**. Missing struct tests → **high**. Style → **low**. + +**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/CodeGen/` + +--- + +### 10. IDE Responsiveness & Reactor Efficiency + +The compiler service must respond to editor keystrokes without unnecessary recomputation. + +**CHECK:** +- Remove unnecessary `Async.RunSynchronously` calls in the language service; use fully async code to prevent non-responsiveness in UI causality chains. +- Ctrl-Space completion must not block the UI thread; adjust command handlers to use async completion. +- Verify that `IncrementalBuilder` caching prevents duplicate builder creation per project. +- Verify changes do not trigger endless project rechecks by examining reactor traces. +- Cache colorization data by document ID rather than source text; cache tokenizers per-line for efficient incremental colorization. +- Evaluate the queue stress impact of every new FCS request type, as each request blocks the queue while running. +- Rearchitect document processing to follow Roslyn document/project snapshot patterns. +- Signature help should show all parameters because F# users rely on this as their primary information source. +- Test IDE changes on large solutions before merging. + +**Severity:** Endless recheck loop → **critical**. UI thread block → **high**. Missing trace verification → **medium**. + +**Hotspots:** `src/Compiler/Service/`, `src/FSharp.Compiler.LanguageServer/`, `vsintegration/` + +--- + +### 11. Overload Resolution Correctness + +Overload resolution is one of the most complex and specification-sensitive areas of the compiler. + +**CHECK:** +- Ensure overload resolution follows the language specification precisely. +- Verify that language features work correctly with truly overloaded method sets, not just single-overload defaults. +- Changes that loosen overload resolution rules constitute language changes and must be analyzed carefully. +- Use `ExcludeHiddenOfMethInfos` consistently to filter methods during SRTP resolution, not just normal name resolution. +- Apply method hiding filters consistently in both normal resolution and SRTP constraint solving paths. +- When adding nullable parameter interop, add rules preferring `X` over `Nullable` and compare full argument lists as a last-resort tiebreaker. +- For complex SRTP corner cases, the implementation effectively serves as the specification; changes must pin existing behavior with tests. + +**Severity:** Overload resolution regression → **critical**. SRTP behavior change → **high**. Missing test → **medium**. + +**Hotspots:** `src/Compiler/Checking/ConstraintSolver.fs`, `src/Compiler/Checking/` + +--- + +### 12. Binary Compatibility & Metadata Safety The pickled metadata format is a cross-version contract. DLLs compiled with any F# version must be consumable by any other version. @@ -79,7 +284,7 @@ The pickled metadata format is a cross-version contract. DLLs compiled with any - Exercise old-compiler-reads-new-output and new-compiler-reads-old-output for any metadata change. - Verify the byte count does not depend on compiler configuration (feature flags, LangVersion, TFM) — only on stream-encoded data. - Add cross-version compatibility tests when changing anonymous record or constraint emission. -- Ensure new parameters are conditionally passed to preserve existing behavior for unchanged callers. +- Before modifying the TAST or pickle format for a new feature, confirm whether the attribute can be handled entirely via existing IL metadata without changing internal representations. **Severity:** Any metadata format breakage → **critical**. Missing compat test → **high**. @@ -89,28 +294,50 @@ The pickled metadata format is a cross-version contract. DLLs compiled with any --- -### 4. IL Emission & Debug Correctness +### 13. Concurrency & Cancellation Safety -Code generation must produce correct, verifiable IL. Debug sequence points must enable accurate stepping. +Async and concurrent code must handle cancellation, thread safety, and exception propagation correctly. **CHECK:** -- Test code changes with optimizations both enabled and disabled. -- Strip debug points when matching on `Expr.Lambda` during code generation. -- Check for existing values on the IL stack when debugging codegen issues with debug points. -- Investigate IL diffs between parallel and sequential type checking to identify determinism root causes. -- Verify that newly added code paths are actually reachable before merging. -- Confirm that `Decimal` cases are not already eliminated by `TryEliminateDesugaredConstants` before adding them. -- Cross-reference `GenFieldInit` in `IlxGen.fs` when modifying field initialization in `infos.fs`. -- Manually verify debug stepping for loops, while loops, task code, list/array expressions, and sequence expressions. -- Perform end-to-end debug stepping verification for all affected control flow cases before merge. +- Thread cancellation tokens through all async operations. All FCS requests must support cancellation. +- Thread `CancellationToken` through each step of the incremental build graph so FCS respects cancellation in a timely way. +- Use `CommonRoslynHelpers.StartAsyncAsTask` with the `cancellationToken` parameter instead of manual task creation. +- Time-sliced type checking must respect a `CancellationToken`; adjust `Eventually` combinators to support cancellation. +- Pass `CancellationToken` into type provider calls to allow cooperative cancellation. +- Ensure thread-safety for shared mutable state. Avoid global mutable state in FCS. +- Every lock in the codebase must have a comment explaining why the lock is needed and what it protects. +- Prefer `System.Collections.Immutable` collections over mutable collections used as read-only. +- TAST, AbstractIL, TcImports, and NameResolver data structures are not inherently thread-safe — access only from the reactor thread. +- Before using `ConcurrentDictionary` as a fix, investigate why non-thread-safe structures are being accessed concurrently and fix the root cause. +- Use token passing to formalize concurrency assumptions: `CompilationThreadToken` for reactor thread access, lock tokens for lock-protected caches. +- Do not swallow `OperationCanceledException` in catch-all handlers. +- Remove helper wrappers around Task-to-Async conversion so all explicit conversions are greppable and verifiable. -**Severity:** Incorrect IL → **critical**. Debug stepping regression → **high**. Missing IL test → **medium**. +**Severity:** Race condition or data corruption → **critical**. Swallowed cancellation → **high**. Missing async test → **medium**. -**Hotspots:** `src/Compiler/CodeGen/`, `src/Compiler/AbstractIL/`, `src/Compiler/Optimize/` +**Hotspots:** `src/Compiler/Service/`, `src/Compiler/Facilities/`, `src/FSharp.Core/`, `vsintegration/` + +--- + +### 14. Incremental Checking Correctness + +Incremental checking must invalidate stale results correctly. Stale data causes timing-dependent glitches. + +**CHECK:** +- Replace old `IsResultObsolete` logic with proper cancellation token checking so stale requests are cancelled rather than run to completion and discarded. +- Avoid returning stale type checking results; prefer fresh results under a timeout or cancellation. +- Use existing `IncrementalBuilder` background project check events (`ProjectChecked`) to obtain full project results rather than triggering redundant checks. +- Time-sliced foreground checking must use a consistent error logger across all time slices to avoid missing errors. +- Verify that project setup handles the case of a clean solution or unrestored packages without silently dropping references. +- Ensure parse/check caches are properly populated with correct thread annotations. + +**Severity:** Stale results causing glitches → **critical**. Missed invalidation → **high**. Missing cache verification → **medium**. + +**Hotspots:** `src/Compiler/Service/`, `src/Compiler/Driver/` --- -### 5. Syntax Tree & Parser Integrity +### 15. Syntax Tree & Parser Integrity AST nodes must accurately represent source code. Parser changes are high-risk because they affect every downstream phase. @@ -130,104 +357,79 @@ AST nodes must accurately represent source code. Parser changes are high-risk be --- -### 6. Parallel Compilation & Determinism +### 16. Exception Handling Discipline -Parallel type checking must produce bit-identical output to sequential compilation. +Exception handling must be precise. Catch-all handlers and swallowed exceptions hide bugs. **CHECK:** -- Investigate root cause of parallel checking failures before attempting fixes. -- Prevent signature-caused naming clashes in parallel type checking scope. -- Ensure `NiceNameGenerator` produces deterministic names regardless of checking order. -- Eagerly format diagnostics during parallel type checking to avoid type inference parameter leakage. -- Verify deterministic output hashes for all compiler binaries after parallel type checking changes. -- Use ILDASM to diff binaries when investigating determinism bugs. -- Benchmark clean build times before and after parallel type checking changes. -- Resolve naming clashes between static members and top-level bindings in parallel type checking. +- Raise internal compiler errors for unexpected type forms (`TType_ucase`, etc.) rather than returning defaults. +- Never swallow exceptions silently; handle `OperationCanceledException` specially. +- Do not suppress task completion by silently ignoring `TrySetException` failures. +- Returning `Unchecked.defaultof<_>` to swallow exceptions is dangerous — investigate and fix the root cause. +- Protect language service entry points against all exceptions to prevent IDE crashes. +- Do not add catch-all exception handlers. -**Severity:** Non-deterministic output → **critical**. Scoping error → **high**. Missing benchmark → **medium**. +**Severity:** Swallowed cancellation → **critical**. Catch-all handler → **high**. Missing error → **medium**. -**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/Driver/` +**Hotspots:** `src/FSharp.Core/`, `src/Compiler/Service/`, `src/Compiler/Optimize/` --- -### 7. Concurrency & Cancellation Safety - -Async and concurrent code must handle cancellation, thread safety, and exception propagation correctly. - -**CHECK:** -- Add observable reproduction tests for async cancellation edge cases. -- Preserve stack traces when re-raising exceptions in async control flow. -- Add tests verifying stack trace preservation through async exception handling paths. -- Keep Task-to-Async conversions explicit and searchable in the codebase. -- Check cancellation token status before executing async operations. -- Verify `Cache` replacement preserves exactly-once initialization semantics of `Lazy` wrappers. -- Do not rely on specific thread pool thread assignments in tests. -- Do not swallow `OperationCanceledException` in catch-all handlers. - -**Severity:** Race condition or data corruption → **critical**. Swallowed cancellation → **high**. Missing async test → **medium**. - -**Hotspots:** `src/Compiler/Service/`, `src/Compiler/Facilities/`, `src/FSharp.Core/` - ---- +### 17. Diagnostic Quality -### 8. IDE Performance & Reactor Efficiency - -The compiler service must respond to editor keystrokes without unnecessary recomputation. +Error and warning messages are the compiler's user interface. They must be precise, consistent, and actionable. **CHECK:** -- Verify that `IncrementalBuilder` caching prevents duplicate builder creation per project. -- Verify that changes do not trigger endless project rechecks by examining reactor traces. -- Analyze reactor trace output to identify unnecessary work triggered by keystrokes. -- Avoid triggering brace matching and navigation bar updates on non-structural keystrokes. -- Verify each diagnostic service makes exactly one request per keystroke. -- Investigate all causes of needless project rebuilding in IDE, not just the first one found. -- Test IDE changes on large solutions before merging. +- Structure error messages as: error statement, then analysis, then actionable advice. +- Format suggestion messages as single-line when `--vserrors` is enabled. +- Emit a warning rather than silently ignoring unsupported default parameter values. +- Eagerly format diagnostics at production time to prevent parameter leakage in parallel checking. +- Only rename error identifiers to reflect actual error message content. +- Enable unused variable warnings by default in projects to catch common bugs. -**Severity:** Endless recheck loop → **critical**. Unnecessary rebuild → **high**. Missing trace verification → **medium**. +**Severity:** Misleading diagnostic → **high**. Inconsistent format → **medium**. Wording improvement → **low**. -**Hotspots:** `src/Compiler/Service/`, `src/FSharp.Compiler.LanguageServer/` +**Hotspots:** `src/Compiler/FSComp.txt`, `src/Compiler/Checking/` --- -### 9. Editor Integration & UX +### 18. Debug Experience Quality -VS extension and editor features must be correct, consistent, and not regress existing workflows. +Debug stepping, breakpoints, and locals display must work correctly. Debug experience regressions silently break developer workflows. **CHECK:** -- Separate legacy project and CPS project modalities clearly before reasoning about changes. -- Revert problematic editor features to preview status when they cause regressions. -- Ship new warnings off-by-default with tests and an IDE CodeFix. -- Use existing compiler helpers for parameter name matching in IDE CodeFixes. -- Fix inconsistent paste indentation behavior that varies by cursor column position. -- Make project cache size a user-configurable setting. -- Use Roslyn mechanisms for analyzer scheduling while tuning delays. +- Ensure debug points and sequence points enable correct stepping behavior. +- Manually verify debug stepping for loops, while loops, task code, list/array expressions, and sequence expressions. +- Ensure all required FSharp.Core target framework builds are produced to avoid `FileNotFoundException` in VS debugging. +- Plan ahead and make sure basic writers can emit debug information before plumbing it through from the type checker. +- Solve debuggability problems generally through techniques that also apply to user-written code. -**Severity:** Editor regression in common workflow → **critical**. Inconsistent behavior → **high**. Missing CodeFix → **medium**. +**Severity:** Breakpoint regression → **critical**. Debug stepping regression → **high**. Missing manual verification → **medium**. -**Hotspots:** `vsintegration/`, `src/FSharp.Compiler.LanguageServer/` +**Hotspots:** `src/Compiler/CodeGen/`, `src/Compiler/AbstractIL/` --- -### 10. Diagnostic Quality +### 19. Parallel Compilation & Determinism -Error and warning messages are the compiler's user interface. They must be precise, consistent, and actionable. +Parallel type checking must produce bit-identical output to sequential compilation. **CHECK:** -- Structure error messages as: error statement, then analysis, then advice. -- Only rename error identifiers to reflect actual error message content. -- Format suggestion messages as single-line when `--vserrors` is enabled. -- Use clearer wording in quotation-related error messages. -- Emit a warning rather than silently ignoring unsupported default parameter values. -- Enable unused variable warnings by default in projects to catch common bugs. -- Eagerly format diagnostics at production time to prevent parameter leakage in parallel checking. +- Investigate root cause of parallel checking failures before attempting fixes. +- Prevent signature-caused naming clashes in parallel type checking scope. +- Ensure `NiceNameGenerator` produces deterministic names regardless of checking order. +- Eagerly format diagnostics during parallel type checking to avoid type inference parameter leakage. +- Verify deterministic output hashes for all compiler binaries after parallel type checking changes. +- Use ILDASM to diff binaries when investigating determinism bugs. +- Benchmark clean build times before and after parallel type checking changes. -**Severity:** Misleading diagnostic → **high**. Inconsistent format → **medium**. Wording improvement → **low**. +**Severity:** Non-deterministic output → **critical**. Scoping error → **high**. Missing benchmark → **medium**. -**Hotspots:** `src/Compiler/FSComp.txt`, `src/Compiler/Checking/` +**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/Driver/` --- -### 11. Feature Gating & Compatibility +### 20. Feature Gating & Compatibility New features must be gated behind language version checks. Breaking changes require RFC process. @@ -235,11 +437,11 @@ New features must be gated behind language version checks. Breaking changes requ - Gate new language features behind a `LanguageFeature` flag even if shipped as bug fixes. - Ship experimental features off-by-default and discuss enablement strategy with stakeholders. - Factor out cleanup changes separately from feature enablement. -- Require an fslang suggestion and RFC for language additions. - Assess compound expression compatibility when introducing new syntax. - Reject changes that alter the C#/.NET visible assembly surface as breaking changes. - Create an RFC retroactively for breaking changes that were merged without one. -- Treat all parser changes as high-risk and review thoroughly. +- Clarify whether IDE features are gated behind specific warning flags or are always active. +- Breaking behavior changes to resource naming must be gated behind an opt-in property. **Severity:** Ungated breaking change → **critical**. Missing RFC → **high**. Bundled cleanup+feature → **medium**. @@ -247,90 +449,69 @@ New features must be gated behind language version checks. Breaking changes requ --- -### 12. Idiomatic F# & Naming +### Additional Dimensions (Evaluate When Applicable) -Compiler code should follow F# idioms and use clear, consistent terminology for internal concepts. +#### Compiler Performance Measurement -**CHECK:** -- Use 4-space indent before pipe characters in compiler code. -- Use `let mutable` instead of `ref` cells for mutable state. -- Use `ResizeArray` type alias instead of `System.Collections.Generic.List<_>`. -- Use pipeline operators for clearer data flow in service code. -- Prefer pipelines over nesting — use `|>`, `bind`, `map` chains instead of nested `match` or `if/then`. -- Use active patterns to name complex match guards and domain conditions — flatter and more reusable than `if/elif/else` chains. -- Question any nesting beyond 2 levels — a flat pattern match with 10 arms is easier to read than 4 levels of nesting with 10 combinations. Prefer wide over deep. -- Shadow variables when rebinding to improve code clarity. -- Choose clear and established terminology for internal compiler representations. -- Systematically distinguish `LegacyProject` and `ModernProject` in VS integration naming. +- Require `--times` output, benchmarks, or profiler data for performance claims. +- Acknowledge that F# compiler service operations are intrinsically more expensive than C# (larger work chunks, no parallelism, more expensive type inference). +- Measure and report build time impact. -**Severity:** Deprecated construct → **medium**. Naming confusion → **medium**. Deep nesting → **medium**. Style deviation → **low**. +#### Memory Footprint Reduction -**Hotspots:** All `src/Compiler/` directories, `vsintegration/` +- Minimize heap allocations and GC pressure in hot paths. +- Consider the 2GB threshold for 32-bit VS processes. +- Use weak references for long-lived data. Use `ConditionalWeakTable` for caches keyed by GC-collected objects. +- Use struct tuples for retained data to reduce allocation overhead. ---- +#### C# Interop Fidelity -### 13. Code Structure & Technical Debt +- Ensure F# types and APIs are usable from C# without friction. +- Wait until the final shape of a C# feature is committed before matching it. -Keep the codebase clean. Extract helpers, remove dead code, and avoid ad-hoc patches. +#### Cross-Platform Correctness -**CHECK:** -- Flag ad-hoc `if condition then specialCase else normalPath` that looks like a band-aid rather than a systematic fix — the fix should be at the source, not patched at a consumer. -- Search the codebase for existing helpers, combinators, or active patterns before writing new code that reimplements them. -- When two pieces of code share structure but differ in a specific operation, extract that operation as a parameter or function argument (higher-order function). -- Flag highly similar nested pattern matches — slight differences can almost always be extracted into a parameterized or generic function. -- Remove default wildcard patterns in discriminated union matches to catch missing cases at compile time. -- Extract duplicated logic into a shared function with input arguments. -- Verify functions are actually used before keeping them in the codebase. -- Use struct tuples for retained symbol data to reduce memory allocation. -- Use `ConditionalWeakTable` for caches keyed by GC-collected objects. -- Fix compiler warnings like unused value bindings before merging. -- Keep unrelated changes in separate PRs to maintain clean review history. -- Follow existing abstraction patterns (e.g., `HasFSharpAttribute`) instead of ad-hoc checks. -- Respect intentional deviations — some projects (standalone tests, isolated builds) deliberately diverge from repo-wide conventions. Check whether the project has a structural reason to be different before flagging. +- Test on all supported platforms; avoid platform-specific assumptions. +- Consider Mono, Linux, macOS when touching paths, resources, or runtime features. -**Severity:** Unreachable code path → **high**. Band-aid patch → **high**. Duplication → **medium**. Missing helper extraction → **low**. +#### Computation Expression Semantics -**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/Optimize/`, `src/Compiler/Service/` +- Tail call behavior in seq and async builders depends on builder implementation, not IL tail calls. +- Prefer orthogonal decisions that work for all code, including user-defined CEs. ---- +#### Type Provider Robustness -### 14. API Surface & Public Contracts +- Handle type provider failures gracefully without crashing the compiler. +- Design-time hosting and framework compatibility must be considered. -FSharp.Core and FCS public APIs are permanent commitments. Changes must be deliberate. +#### Signature File Discipline -**CHECK:** -- Reject changes that alter the C#/.NET visible assembly surface as breaking changes. -- Verify accuracy of IL-to-expression inversion and add systematic tests for quoted expression decompilation. -- Ensure decompiled expression trees would pass type checking when reconverted to code. -- Fix the root cause in `ConvExprPrim` rather than patching individual expression decompilation cases. -- Document the purpose of new public API arguments in XML docs. -- Update exception XML docs in `.fsi` files when behavior changes. -- Name potentially long-running FCS API methods with descriptive prefixes like `GetOptimized`. -- Systematically apply `InlineIfLambda` to every inlined function taking a lambda applied only once. - -**Severity:** Unintended public API break → **critical**. Missing XML docs → **medium**. Naming convention → **low**. +- Keep signature files in sync and use them to control API surface. +- `.fsi` files define the public contract; implementation files must match. -**Hotspots:** `src/FSharp.Core/`, `src/Compiler/Service/` - ---- +#### Build Infrastructure -### 15. Build & Packaging Integrity +- Keep build scripts simple and cross-platform compatible. +- Prefer MSBuild targets over shell script wrappers. +- Eliminate unnecessary build dependencies. -Build configuration and NuGet packaging must be correct, reproducible, and not introduce unnecessary dependencies. +#### Code Structure & Technical Debt -**CHECK:** -- Avoid packaging internal files as content files in NuGet packages. -- Place non-library binaries in `tools/` directory rather than `lib/` in NuGet packages. -- Remove incorrect package dependencies from FSharp.Core nuspec. -- Be mindful of package size impact when adding dependencies to core assemblies. -- Run `build proto` or `git clean -xfd` when encountering stale build state issues. -- Preserve parallel build settings unless there is a documented reason to remove them. -- Verify nuspec comments match the correct package version. -- Fix dynamic loader to handle transitive native references for NuGet packages in FSI. +- Search the codebase for existing helpers, combinators, or active patterns before writing new code. +- When two pieces of code share structure but differ in a specific operation, extract that operation as a parameter (higher-order function). +- Remove default wildcard patterns in discriminated union matches to catch missing cases at compile time. +- Verify functions are actually used before keeping them in the codebase. +- Keep unrelated changes in separate PRs. +- Follow existing abstraction patterns (e.g., `HasFSharpAttribute`) instead of ad-hoc checks. +- Respect intentional deviations — some projects deliberately diverge from repo-wide conventions for structural reasons. -**Severity:** Wrong package layout → **high**. Stale build state → **medium**. Missing comment → **low**. +#### Naming & Formatting -**Hotspots:** `eng/`, `setup/`, `src/Compiler/Driver/` +- Choose precise, descriptive names that reflect actual semantics. +- Document or name tuple fields in AST union cases to clarify their meaning rather than leaving them as anonymous positional values. +- Use 4-space indent before pipe characters. Use `let mutable` instead of `ref` cells. Use `ResizeArray` type alias. +- Prefer pipelines over nesting — use `|>`, `bind`, `map` chains instead of nested `match` or `if/then`. +- Question any nesting beyond 2 levels — prefer wide over deep. --- @@ -348,47 +529,52 @@ Execute review in five waves, each building on the previous. ### Wave 1: Structural Scan 1. Verify every behavioral change has a corresponding test (Dimension 1). -2. Check feature gating — new features must have `LanguageFeature` guards (Dimension 11). -3. Verify no unintended public API changes (Dimension 14). -4. Check for binary compatibility concerns in pickle/import code (Dimension 3). +2. Check feature gating — new features must have `LanguageFeature` guards (Dimension 20). +3. Verify no unintended public API changes (Dimension 7). +4. Check for binary compatibility concerns in pickle/import code (Dimension 12). +5. If FSharp.Core is touched, apply FSharp.Core Stability checks (Dimension 2). ### Wave 2: Correctness Deep-Dive -1. Trace type checking changes through constraint solving and inference (Dimension 2). -2. Verify IL emission correctness with both Debug and Release optimizations (Dimension 4). -3. Validate AST node accuracy against source syntax (Dimension 5). -4. Check parallel determinism if checking/name-generation code is touched (Dimension 6). +1. Trace type checking changes through constraint solving and inference (Dimension 8). +2. Verify IL emission correctness with both Debug and Release optimizations (Dimension 5). +3. Validate AST node accuracy against source syntax (Dimension 15). +4. Check parallel determinism if checking/name-generation code is touched (Dimension 19). +5. Verify optimization correctness — no semantic-altering transforms (Dimension 6). +6. Verify struct semantics if value types are involved (Dimension 9). ### Wave 3: Runtime & Integration -1. Verify concurrency safety — no races, proper cancellation, stack traces preserved (Dimension 7). -2. Check IDE reactor impact — no unnecessary rechecks or keystroke-triggered rebuilds (Dimension 8). -3. Validate editor feature behavior across project types (Dimension 9). -4. Review diagnostic message quality (Dimension 10). +1. Verify concurrency safety — no races, proper cancellation, stack traces preserved (Dimension 13). +2. Check IDE reactor impact — no unnecessary rechecks or keystroke-triggered rebuilds (Dimension 10). +3. Verify overload resolution correctness if constraint solving changes (Dimension 11). +4. Check incremental checking correctness — no stale results (Dimension 14). +5. Review diagnostic message quality (Dimension 17). +6. Verify debug experience — stepping, breakpoints, locals (Dimension 18). ### Wave 4: Quality & Polish -1. Check F# idiom adherence and naming consistency (Dimension 12). -2. Look for code structure improvements — dead code, duplication, missing abstractions (Dimension 13). -3. Verify build and packaging correctness (Dimension 15). +1. Check code structure — dead code, duplication, missing abstractions. +2. Verify naming consistency and F# idiom adherence. +3. Verify build and packaging correctness. 4. Confirm all test baselines are updated and explained. ## Folder Hotspot Mapping | Directory | Primary Dimensions | Secondary Dimensions | |---|---|---| -| `src/Compiler/Checking/` | Type System (2), Parallel Compilation (6), Feature Gating (11) | Diagnostics (10), Code Structure (13) | -| `src/Compiler/CodeGen/` | IL Emission (4) | Test Coverage (1), Debug Correctness (4) | -| `src/Compiler/SyntaxTree/` | Parser Integrity (5) | Feature Gating (11), Code Structure (13) | -| `src/Compiler/TypedTree/` | Binary Compatibility (3), Type System (2) | Parallel Compilation (6) | -| `src/Compiler/Optimize/` | IL Emission (4), Code Structure (13) | Test Coverage (1) | -| `src/Compiler/AbstractIL/` | IL Emission (4) | Debug Correctness (4) | -| `src/Compiler/Driver/` | Binary Compatibility (3), Build (15) | Parallel Compilation (6) | -| `src/Compiler/Service/` | IDE Performance (8), Concurrency (7) | API Surface (14) | -| `src/Compiler/Facilities/` | Feature Gating (11), Concurrency (7) | Naming (12) | -| `src/Compiler/FSComp.txt` | Diagnostic Quality (10) | — | -| `src/FSharp.Core/` | API Surface (14), Concurrency (7) | Idiomatic F# (12) | -| `src/FSharp.Compiler.LanguageServer/` | IDE Performance (8), Editor UX (9) | Concurrency (7) | -| `vsintegration/` | Editor Integration (9) | IDE Performance (8), Naming (12) | -| `tests/` | Test Coverage (1) | All dimensions | -| `eng/`, `setup/` | Build & Packaging (15) | — | +| `src/Compiler/Checking/` | Type System (8), Overload Resolution (11), Struct Awareness (9) | Feature Gating (20), Parallel Compilation (19), Diagnostics (17) | +| `src/Compiler/CodeGen/` | IL Emission (5), Debug Experience (18) | Struct Awareness (9), Test Coverage (1) | +| `src/Compiler/SyntaxTree/` | Parser Integrity (15), Typed Tree (8) | Feature Gating (20), Debug Experience (18) | +| `src/Compiler/TypedTree/` | Binary Compatibility (12), Type System (8), Typed Tree (8) | Backward Compat (3), Memory Footprint | +| `src/Compiler/Optimize/` | IL Emission (5), Optimization Correctness (6) | Test Coverage (1) | +| `src/Compiler/AbstractIL/` | IL Emission (5) | Backward Compat (3), Memory Footprint | +| `src/Compiler/Driver/` | Build Infrastructure, Incremental Checking (14) | Compiler Performance, Cancellation (13) | +| `src/Compiler/Service/` | FCS API Surface (7), IDE Performance (10), Concurrency (13) | Test Coverage (1), Incremental Checking (14) | +| `src/Compiler/Facilities/` | Feature Gating (20), Concurrency (13) | Naming | +| `src/Compiler/FSComp.txt` | Diagnostic Quality (17) | — | +| `src/FSharp.Core/` | FSharp.Core Stability (2), XML Doc, Backward Compat (3) | RFC Process (4), Test Coverage (1) | +| `src/FSharp.Compiler.LanguageServer/` | IDE Performance (10), Editor UX | Concurrency (13) | +| `vsintegration/` | IDE Responsiveness (10), Memory Footprint | Build Infrastructure, Cross-Platform | +| `tests/` | Test Coverage (1), Compiler Performance | Backward Compat (3), Cross-Platform | +| `eng/`, `setup/` | Build Infrastructure | Cross-Platform | diff --git a/.github/instructions/ExpertReview.instructions.md b/.github/instructions/ExpertReview.instructions.md index 67dbb5d73e..959694f8fe 100644 --- a/.github/instructions/ExpertReview.instructions.md +++ b/.github/instructions/ExpertReview.instructions.md @@ -9,17 +9,37 @@ applyTo: - Add tests for every behavioral change before merging — missing coverage is the most common review blocker. - Explain all new errors in test baselines and confirm each is expected behavior, not a regression. +- Test on IL-defined members, not just F#-defined ones. Check `UseNullAsTrueValue` edge cases. ## Type safety - Use `tryTcrefOfAppTy` or the `AppTy` active pattern instead of direct `TType` pattern matches — direct matches miss erased types. +- Always call `stripTyEqns` before pattern matching on types. Never match `TType_app` directly. - Remove default wildcard patterns in discriminated union matches so the compiler catches missing cases. +- Raise internal compiler errors for unexpected type forms (`TType_ucase`) rather than returning defaults. ## Feature gating - Gate every new language feature behind a `LanguageFeature` flag and ship off-by-default until stable. - Factor cleanup changes into separate commits from feature enablement. +- Major language changes require an RFC before implementation. ## Code generation - Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. +- Verify no tail-call behavior changes — check IL diffs before and after. +- After stripping a lambda for a delegate, bind discarded unit parameters using `BindUnitVars`. + +## Backward compatibility + +- Codegen changes that depend on new FSharp.Core functions must guard against older FSharp.Core versions. +- Do not alter the C#/.NET visible assembly surface without treating it as a breaking change. + +## Concurrency + +- Thread cancellation tokens through all async operations; uncancellable operations on the reactor thread are blocking bugs. +- Do not add catch-all exception handlers. Never swallow `OperationCanceledException`. + +## Performance + +- Performance claims require `--times` output, benchmarks, or profiler evidence. Do not inline large functions without data. diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index dc14c8185e..f7d4290f17 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -1,6 +1,6 @@ --- name: reviewing-compiler-prs -description: "Performs multi-agent, multi-model code review of F# compiler PRs across 15 dimensions including type checking, IL emission, binary compatibility, parallel determinism, and IDE performance. Dispatches parallel assessment agents per dimension, consolidates with cross-model agreement scoring, and filters false positives. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." +description: "Performs multi-agent, multi-model code review of F# compiler PRs across 20 dimensions including type checking, IL emission, binary compatibility, parallel determinism, and IDE performance. Dispatches parallel assessment agents per dimension, consolidates with cross-model agreement scoring, and filters false positives. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." --- # Reviewing Compiler PRs @@ -10,22 +10,28 @@ Full dimension definitions and CHECK rules live in the `expert-reviewer` agent. ## When to Invoke - PR touches `src/Compiler/` — invoke the `expert-reviewer` agent -- PR touches `src/FSharp.Core/` — focus on API Surface & Concurrency -- PR touches `vsintegration/` or `LanguageServer/` — focus on IDE Performance & Editor UX -- PR touches `tests/` only — quick check: baselines explained? Cross-TFM coverage? +- PR touches `src/FSharp.Core/` — focus on FSharp.Core Stability, API Surface, Backward Compat, XML Docs +- PR touches `vsintegration/` or `LanguageServer/` — focus on IDE Responsiveness, Concurrency, Memory +- PR touches `tests/` only — quick check: baselines explained? Cross-TFM coverage? Tests actually assert? +- PR touches `eng/` or build scripts — focus on Build Infrastructure, Cross-Platform ## Dimension Selection | Files Changed | Focus Dimensions | |---|---| -| `Checking/`, `TypedTree/` | Type System, Parallel Compilation, Feature Gating | -| `CodeGen/`, `AbstractIL/`, `Optimize/` | IL Emission, Debug Correctness, Test Coverage | -| `SyntaxTree/`, `pars.fsy` | Parser Integrity, Feature Gating | +| `Checking/`, `TypedTree/` | Type System, Overload Resolution, Struct Awareness, Parallel Compilation, Feature Gating | +| `CodeGen/`, `AbstractIL/` | IL Emission, Debug Experience, Test Coverage | +| `Optimize/` | Optimization Correctness, IL Emission, Test Coverage | +| `SyntaxTree/`, `pars.fsy` | Parser Integrity, Feature Gating, Typed Tree Discipline | | `TypedTreePickle.*`, `CompilerImports.*` | Binary Compatibility (highest priority) | -| `Service/`, `LanguageServer/` | IDE Performance, Concurrency | +| `Service/` | FCS API Surface, IDE Responsiveness, Concurrency, Incremental Checking | +| `LanguageServer/` | IDE Responsiveness, Concurrency | +| `Driver/` | Build Infrastructure, Incremental Checking, Cancellation | +| `Facilities/` | Feature Gating, Concurrency | | `FSComp.txt` | Diagnostic Quality | -| `FSharp.Core/` | API Surface, Concurrency | -| `vsintegration/` | Editor Integration | +| `FSharp.Core/` | FSharp.Core Stability, Backward Compat, XML Docs, RFC Process | +| `vsintegration/` | IDE Responsiveness, Memory Footprint, Cross-Platform | +| `eng/`, `setup/`, build scripts | Build Infrastructure, Cross-Platform | ## Multi-Model Dispatch @@ -49,10 +55,20 @@ Dispatch one agent per selected dimension. For high-confidence reviews, assess e ## Self-Review Checklist -1. [ ] Every behavioral change has a test -2. [ ] Test baselines updated with explanations -3. [ ] New language features have a `LanguageFeature` guard -4. [ ] No unintended public API surface changes -5. [ ] Cleanup changes separate from feature enablement -6. [ ] Tests pass in both Debug and Release -7. [ ] Error messages follow: statement → analysis → advice +Ordered by blocking frequency: + +1. [ ] Every behavioral change has a test — **most common review blocker** +2. [ ] FSharp.Core changes maintain strict binary compatibility +3. [ ] No unintended public API surface changes (FCS or FSharp.Core) +4. [ ] New language features have a `LanguageFeature` guard and RFC +5. [ ] Overload resolution and type inference behavior preserved +6. [ ] No raw `TType_*` pattern matching without `stripTyEqns` +7. [ ] Cancellation tokens threaded through async operations +8. [ ] IL emission verified in both Debug and Release +9. [ ] Struct semantics respected (no unnecessary copies) +10. [ ] Cleanup changes separate from feature enablement +11. [ ] Test baselines updated with explanations +12. [ ] Tests pass in both Debug and Release +13. [ ] Error messages follow: statement → analysis → advice +14. [ ] No catch-all exception handlers; `OperationCanceledException` handled specially +15. [ ] Debug stepping manually verified for affected control flow From 585ebeb049de907d4aa6e322257df4d031034ae4 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 14:54:58 +0100 Subject: [PATCH 08/18] Remove overfitted content: parallel compilation dimension, reactor references, test running - Remove Dimension 19 (Parallel Compilation) - single-feature overfitting Determinism rules kept in overarching principles where they belong - Remove all 'reactor thread' references (0 matches in src/, obsolete concept) - Change 'Run tests in Release' to 'Ensure tests cover Debug/Release differences' CI runs tests, not developers - focus on having tests, not running them - 20 dimensions -> 19 dimensions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 53 ++++++------------- .../instructions/ExpertReview.instructions.md | 2 +- .github/skills/expert-review/SKILL.md | 4 +- 3 files changed, 20 insertions(+), 39 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index 35eb60aab7..710761c56f 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -1,11 +1,11 @@ --- name: expert-reviewer -description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates type checking, IL emission, AST correctness, binary compatibility, parallel determinism, concurrency, IDE performance, diagnostics, and code quality across 20 dimensions. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." +description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates type checking, IL emission, AST correctness, binary compatibility, concurrency, IDE performance, diagnostics, and code quality across 19 dimensions. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." --- # Expert Reviewer -Evaluates F# compiler changes across 20 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. +Evaluates F# compiler changes across 19 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. **Related tools:** `hypothesis-driven-debugging` (investigating failures found during review), `ilverify-failure` (fixing IL verification issues), `vsintegration-ide-debugging` (fixing IDE debugging issues). @@ -17,7 +17,7 @@ Evaluates F# compiler changes across 20 dimensions. Use the `reviewing-compiler- - **Determinism is a correctness property.** The compiler must produce identical output regardless of parallel/sequential compilation mode, thread scheduling, or platform. - **Feature gating protects users.** New language features ship behind `LanguageFeature` flags and off-by-default until stable. Breaking changes require an RFC. Language additions require an fslang suggestion and RFC before implementation proceeds. - **Diagnostics are user-facing.** Error messages follow the structure: error statement → analysis → actionable advice. Wording changes need the same care as API changes. -- **IDE responsiveness is a feature.** Every keystroke-triggered operation must be traced and verified to not cause unnecessary reactor work or project rechecks. Evaluate the queue stress impact of every new FCS request type. +- **IDE responsiveness is a feature.** Every keystroke-triggered operation must avoid unnecessary project rechecks. Evaluate the queue stress impact of every new FCS request type. - **Prefer general solutions over special cases.** Do not hardwire specific library optimizations into the compiler. Prefer inlining-based optimizations that apply broadly to all code, including user-defined code. - **Evidence-based performance.** Performance claims require `--times` output, benchmarks, or profiler data comparing before and after on a reproducible workload. Do not inline large functions without performance evidence. - **Guard the API surface.** Public API additions to FCS and FSharp.Core must be carefully controlled. Internal data structures must never leak. Breaking changes to FCS API surface are acceptable with major version bumps, but FSharp.Core must remain stable. @@ -53,8 +53,8 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b - Tests must actually assert the claimed behavior — a test that calls a function without checking results is not a test. - Confirm new tests cover behavior not already exercised by existing test suites. - Explain all new errors in test baselines and confirm they are expected. -- Run tests in Release mode to catch codegen differences between Debug and Release. -- Run tests on both desktop (.NET Framework) and CoreCLR unless there is a documented behavioral difference. +- Ensure tests cover both Debug and Release codegen differences when relevant. +- Consider cross-TFM behavior (desktop .NET Framework vs CoreCLR) for platform-sensitive changes. - Test on IL-defined members, not just F#-defined ones. - Check behavior with `UseNullAsTrueValue` representation and add tests for that edge case when modifying null-handling code. - Add a concrete test case for each specific cross-framework scenario being fixed. @@ -184,7 +184,7 @@ The FCS public API is a permanent commitment. Internal types must never leak. - Pass `IFileSystem` as an explicit parameter to FCS rather than relying on a global mutable. - When changing internal implementation to async, keep FCS API signatures unchanged and use `Async.Return` internally. - Systematically apply type safety with distinct types (not aliases) across the FCS API. -- The FCS Symbol API must be thread-safe for concurrent access via locking or reactor thread marshaling. +- The FCS Symbol API must be thread-safe for concurrent access. - Document the purpose of new public API arguments in XML docs. - Update exception XML docs in `.fsi` files when behavior changes. @@ -234,7 +234,7 @@ Structs have value semantics that differ fundamentally from reference types. Inc --- -### 10. IDE Responsiveness & Reactor Efficiency +### 10. IDE Responsiveness The compiler service must respond to editor keystrokes without unnecessary recomputation. @@ -242,7 +242,7 @@ The compiler service must respond to editor keystrokes without unnecessary recom - Remove unnecessary `Async.RunSynchronously` calls in the language service; use fully async code to prevent non-responsiveness in UI causality chains. - Ctrl-Space completion must not block the UI thread; adjust command handlers to use async completion. - Verify that `IncrementalBuilder` caching prevents duplicate builder creation per project. -- Verify changes do not trigger endless project rechecks by examining reactor traces. +- Verify changes do not trigger endless project rechecks. - Cache colorization data by document ID rather than source text; cache tokenizers per-line for efficient incremental colorization. - Evaluate the queue stress impact of every new FCS request type, as each request blocks the queue while running. - Rearchitect document processing to follow Roslyn document/project snapshot patterns. @@ -307,9 +307,9 @@ Async and concurrent code must handle cancellation, thread safety, and exception - Ensure thread-safety for shared mutable state. Avoid global mutable state in FCS. - Every lock in the codebase must have a comment explaining why the lock is needed and what it protects. - Prefer `System.Collections.Immutable` collections over mutable collections used as read-only. -- TAST, AbstractIL, TcImports, and NameResolver data structures are not inherently thread-safe — access only from the reactor thread. +- TAST, AbstractIL, TcImports, and NameResolver data structures are not inherently thread-safe — ensure proper synchronization. - Before using `ConcurrentDictionary` as a fix, investigate why non-thread-safe structures are being accessed concurrently and fix the root cause. -- Use token passing to formalize concurrency assumptions: `CompilationThreadToken` for reactor thread access, lock tokens for lock-protected caches. +- Use token passing to formalize concurrency assumptions: `CompilationThreadToken` for single-threaded compiler phases, lock tokens for lock-protected caches. - Do not swallow `OperationCanceledException` in catch-all handlers. - Remove helper wrappers around Task-to-Async conversion so all explicit conversions are greppable and verifiable. @@ -410,26 +410,7 @@ Debug stepping, breakpoints, and locals display must work correctly. Debug exper --- -### 19. Parallel Compilation & Determinism - -Parallel type checking must produce bit-identical output to sequential compilation. - -**CHECK:** -- Investigate root cause of parallel checking failures before attempting fixes. -- Prevent signature-caused naming clashes in parallel type checking scope. -- Ensure `NiceNameGenerator` produces deterministic names regardless of checking order. -- Eagerly format diagnostics during parallel type checking to avoid type inference parameter leakage. -- Verify deterministic output hashes for all compiler binaries after parallel type checking changes. -- Use ILDASM to diff binaries when investigating determinism bugs. -- Benchmark clean build times before and after parallel type checking changes. - -**Severity:** Non-deterministic output → **critical**. Scoping error → **high**. Missing benchmark → **medium**. - -**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/Driver/` - ---- - -### 20. Feature Gating & Compatibility +### 19. Feature Gating & Compatibility New features must be gated behind language version checks. Breaking changes require RFC process. @@ -529,7 +510,7 @@ Execute review in five waves, each building on the previous. ### Wave 1: Structural Scan 1. Verify every behavioral change has a corresponding test (Dimension 1). -2. Check feature gating — new features must have `LanguageFeature` guards (Dimension 20). +2. Check feature gating — new features must have `LanguageFeature` guards (Dimension 19). 3. Verify no unintended public API changes (Dimension 7). 4. Check for binary compatibility concerns in pickle/import code (Dimension 12). 5. If FSharp.Core is touched, apply FSharp.Core Stability checks (Dimension 2). @@ -539,14 +520,14 @@ Execute review in five waves, each building on the previous. 1. Trace type checking changes through constraint solving and inference (Dimension 8). 2. Verify IL emission correctness with both Debug and Release optimizations (Dimension 5). 3. Validate AST node accuracy against source syntax (Dimension 15). -4. Check parallel determinism if checking/name-generation code is touched (Dimension 19). +4. Check parallel determinism if checking/name-generation code is touched. 5. Verify optimization correctness — no semantic-altering transforms (Dimension 6). 6. Verify struct semantics if value types are involved (Dimension 9). ### Wave 3: Runtime & Integration 1. Verify concurrency safety — no races, proper cancellation, stack traces preserved (Dimension 13). -2. Check IDE reactor impact — no unnecessary rechecks or keystroke-triggered rebuilds (Dimension 10). +2. Check IDE impact — no unnecessary rechecks or keystroke-triggered rebuilds (Dimension 10). 3. Verify overload resolution correctness if constraint solving changes (Dimension 11). 4. Check incremental checking correctness — no stale results (Dimension 14). 5. Review diagnostic message quality (Dimension 17). @@ -563,15 +544,15 @@ Execute review in five waves, each building on the previous. | Directory | Primary Dimensions | Secondary Dimensions | |---|---|---| -| `src/Compiler/Checking/` | Type System (8), Overload Resolution (11), Struct Awareness (9) | Feature Gating (20), Parallel Compilation (19), Diagnostics (17) | +| `src/Compiler/Checking/` | Type System (8), Overload Resolution (11), Struct Awareness (9) | Feature Gating (19), Diagnostics (17) | | `src/Compiler/CodeGen/` | IL Emission (5), Debug Experience (18) | Struct Awareness (9), Test Coverage (1) | -| `src/Compiler/SyntaxTree/` | Parser Integrity (15), Typed Tree (8) | Feature Gating (20), Debug Experience (18) | +| `src/Compiler/SyntaxTree/` | Parser Integrity (15), Typed Tree (8) | Feature Gating (19), Debug Experience (18) | | `src/Compiler/TypedTree/` | Binary Compatibility (12), Type System (8), Typed Tree (8) | Backward Compat (3), Memory Footprint | | `src/Compiler/Optimize/` | IL Emission (5), Optimization Correctness (6) | Test Coverage (1) | | `src/Compiler/AbstractIL/` | IL Emission (5) | Backward Compat (3), Memory Footprint | | `src/Compiler/Driver/` | Build Infrastructure, Incremental Checking (14) | Compiler Performance, Cancellation (13) | | `src/Compiler/Service/` | FCS API Surface (7), IDE Performance (10), Concurrency (13) | Test Coverage (1), Incremental Checking (14) | -| `src/Compiler/Facilities/` | Feature Gating (20), Concurrency (13) | Naming | +| `src/Compiler/Facilities/` | Feature Gating (19), Concurrency (13) | Naming | | `src/Compiler/FSComp.txt` | Diagnostic Quality (17) | — | | `src/FSharp.Core/` | FSharp.Core Stability (2), XML Doc, Backward Compat (3) | RFC Process (4), Test Coverage (1) | | `src/FSharp.Compiler.LanguageServer/` | IDE Performance (10), Editor UX | Concurrency (13) | diff --git a/.github/instructions/ExpertReview.instructions.md b/.github/instructions/ExpertReview.instructions.md index 959694f8fe..a7464b801f 100644 --- a/.github/instructions/ExpertReview.instructions.md +++ b/.github/instructions/ExpertReview.instructions.md @@ -37,7 +37,7 @@ applyTo: ## Concurrency -- Thread cancellation tokens through all async operations; uncancellable operations on the reactor thread are blocking bugs. +- Thread cancellation tokens through all async operations; uncancellable long-running operations are blocking bugs. - Do not add catch-all exception handlers. Never swallow `OperationCanceledException`. ## Performance diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index f7d4290f17..47ee40f9d6 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -1,6 +1,6 @@ --- name: reviewing-compiler-prs -description: "Performs multi-agent, multi-model code review of F# compiler PRs across 20 dimensions including type checking, IL emission, binary compatibility, parallel determinism, and IDE performance. Dispatches parallel assessment agents per dimension, consolidates with cross-model agreement scoring, and filters false positives. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." +description: "Performs multi-agent, multi-model code review of F# compiler PRs across 19 dimensions including type checking, IL emission, binary compatibility, and IDE performance. Dispatches parallel assessment agents per dimension, consolidates with cross-model agreement scoring, and filters false positives. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." --- # Reviewing Compiler PRs @@ -19,7 +19,7 @@ Full dimension definitions and CHECK rules live in the `expert-reviewer` agent. | Files Changed | Focus Dimensions | |---|---| -| `Checking/`, `TypedTree/` | Type System, Overload Resolution, Struct Awareness, Parallel Compilation, Feature Gating | +| `Checking/`, `TypedTree/` | Type System, Overload Resolution, Struct Awareness, Feature Gating | | `CodeGen/`, `AbstractIL/` | IL Emission, Debug Experience, Test Coverage | | `Optimize/` | Optimization Correctness, IL Emission, Test Coverage | | `SyntaxTree/`, `pars.fsy` | Parser Integrity, Feature Gating, Typed Tree Discipline | From 8ad7d8c30dd00b346589634337afced7f95a26c6 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 15:00:04 +0100 Subject: [PATCH 09/18] Remove overfitted single-PR CHECK items, generalize rules Audit all 19 dimensions for rules that only make sense for the specific historical PR they were derived from. Rules that reference specific functions, specific files, or specific implementation patterns that a future reviewer wouldn't be able to apply were either: - Generalized (e.g., 'Use AppTy active pattern' kept, but 'Cross-reference GenFieldInit in IlxGen.fs when modifying infos.fs' removed) - Removed when too narrow (e.g., 'Cache colorization data by document ID') 581 -> 508 lines. CHECK items reduced from verbose per-dimension to focused generalizable rules. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 127 +++++++++--------------------- 1 file changed, 37 insertions(+), 90 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index 710761c56f..05e503b2f0 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -55,9 +55,6 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b - Explain all new errors in test baselines and confirm they are expected. - Ensure tests cover both Debug and Release codegen differences when relevant. - Consider cross-TFM behavior (desktop .NET Framework vs CoreCLR) for platform-sensitive changes. -- Test on IL-defined members, not just F#-defined ones. -- Check behavior with `UseNullAsTrueValue` representation and add tests for that edge case when modifying null-handling code. -- Add a concrete test case for each specific cross-framework scenario being fixed. - Place tests in the appropriate layer based on what changed: - Typecheck tests: type inference, constraint solving, overload resolution, expected warnings/errors - SyntaxTreeTests: parser/syntax changes @@ -79,14 +76,12 @@ FSharp.Core is the one assembly every F# program references. Changes here have o **CHECK:** - Maintain strict backward binary compatibility. No public API removals or signature changes. -- Use existing FSharp.Core reflection APIs (e.g., `FSharpType.GetTupleElements()`) rather than reimplementing tuple inspection logic. -- Prefer consolidated FSharp.Core changes through a single well-reviewed PR over multiple competing implementations. - Verify compilation order constraints — FSharp.Core has strict file ordering requirements. - Add unit tests to `FSharp.Core.Tests` for every new or changed function. - Minimize FCS's FSharp.Core dependency — do not add new references from the compiler to FSharp.Core unnecessarily. - XML doc comments are mandatory for all public APIs in FSharp.Core. -- Changes require an RFC for new API additions. -- Systematically apply `InlineIfLambda` to every inlined function taking a lambda applied only once. +- New API additions require an RFC. +- Apply `InlineIfLambda` to inlined functions taking a lambda applied only once. **Severity:** Binary compat break in FSharp.Core → **critical**. Missing tests → **high**. Missing XML docs → **medium**. @@ -100,13 +95,10 @@ Changes must not break existing compiled code or binary compatibility. **CHECK:** - Verify changes do not break existing compiled code or binary compatibility. -- Breaking changes should be gated as a strongly worded warning first, not a hard error. -- Do not assume the nullary union case comes first; search for it explicitly as the ordering may be relaxed in the future. -- Defer language changes that might conflict with future constrained extension syntax to avoid locking in behavior that blocks later evolution. -- Revert existing public API signatures and add new APIs alongside them rather than replacing. +- Breaking changes should be gated as a warning first, not a hard error. +- Add new APIs alongside existing ones rather than replacing signatures. - Codegen changes that depend on new FSharp.Core functions must guard against older FSharp.Core versions. -- Do not reduce information shown to users compared to previous VS versions. -- Question default value changes that alter existing IDE behavior. +- Consider forward compatibility — avoid locking in behavior that blocks future language evolution. **Severity:** Binary compat break → **critical**. Behavioral change without flag → **high**. Missing compat test → **high**. @@ -122,10 +114,8 @@ Major language changes require an RFC and design discussion before implementatio - Require an fslang suggestion and RFC for language and API additions. - Submit one consolidated PR per RFC rather than multiple partial PRs. - Update or create the RFC document when implementing a language or interop feature change. +- Keep design discussion in the RFC, not in PR comments. - Do not rush language changes into a release without proper design review. -- Approve design adjustments when they correct version-specific behavior that was not intended. -- Please only discuss the implementation in PR comments, not the design — design belongs in the RFC. -- Close PRs that haven't been touched for a very long time. **Severity:** Language change without RFC → **critical**. Missing RFC update → **high**. Design discussion in PR → **medium**. @@ -139,15 +129,10 @@ Code generation must produce correct, verifiable IL. Wrong IL produces silent ru **CHECK:** - Ensure emitted IL is verifiable and matches expected instruction patterns. -- Verify no changes in tail-calling behavior from your change — check IL diffs. -- After stripping a lambda for a delegate, bind discarded unit parameters into expression bodies using `BindUnitVars`. -- Verify generated IL when changing null-check patterns; prefer fixing codegen for idiomatic patterns rather than changing the source pattern. -- Verify IL binary writing produces correct PDB and metadata table sizes. -- Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. +- Verify no changes in tail-calling behavior — check IL diffs before and after. - Test code changes with optimizations both enabled and disabled. -- Check that return-a-tuple-and-match-on-it formulations produce code at least as good as before. -- Cross-reference `GenFieldInit` in `IlxGen.fs` when modifying field initialization in `infos.fs`. -- If a problem like debuggability or performance exists, solve it generally through techniques that also apply to user-written code. +- Solve debuggability or performance problems generally through techniques that also apply to user-written code, not special-cased optimizations. +- Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. **Severity:** Incorrect IL → **critical**. Debug stepping regression → **high**. Missing IL test → **medium**. @@ -179,12 +164,9 @@ The FCS public API is a permanent commitment. Internal types must never leak. **CHECK:** - Keep internal implementation details out of the public FCS API. -- Create dedicated `FSharpChecker` instances for VS integration to allow adjusting parameters like project cache size independently. -- Place keyword metadata tables in the compiler layer (`lexhelp.fs`) rather than the editor layer so all FCS consumers benefit. -- Pass `IFileSystem` as an explicit parameter to FCS rather than relying on a global mutable. -- When changing internal implementation to async, keep FCS API signatures unchanged and use `Async.Return` internally. -- Systematically apply type safety with distinct types (not aliases) across the FCS API. - The FCS Symbol API must be thread-safe for concurrent access. +- When changing internal implementation to async, keep FCS API signatures unchanged. +- Apply type safety with distinct types (not aliases) across the FCS API. - Document the purpose of new public API arguments in XML docs. - Update exception XML docs in `.fsi` files when behavior changes. @@ -199,15 +181,11 @@ The FCS public API is a permanent commitment. Internal types must never leak. Type checking and inference must be sound. Subtle bugs in constraint solving, overload resolution, or scope handling produce incorrect programs. **CHECK:** -- Use `tryTcrefOfAppTy` or the `AppTy` active pattern instead of direct `TType` pattern matches. -- Always call `stripTyEqns`/`stripTyEqnsA` before pattern matching on types. -- Do not match using `TType_app` directly; use the `AppTy` active pattern. -- Raise internal compiler errors for `TType_ucase` and other unexpected type forms. -- Avoid matching on empty contents as a proxy for signature file presence — use explicit markers. -- Exclude current file signatures from `tcState` when processing implementation files. -- Add error recovery for namespace fragment combination differences between parallel and sequential paths. +- Always call `stripTyEqns`/`stripTyEqnsA` before pattern matching on types. Use `AppTy` active pattern, not `TType_app` directly. +- Use `tryTcrefOfAppTy` or typed active patterns instead of direct `TType` pattern matches. +- Raise internal compiler errors for unexpected type forms rather than returning defaults. - Use precise type predicates when matching on the typed tree. -- Avoid explicit pattern matching on `ILEvent`, `ILEventInfo`, `ILFieldInfo`, `ILTypeInfo`, `ILProp`, `ILMeth`; use their property accessors instead. +- Use property accessors on IL info types rather than explicit pattern matching. **Severity:** Type system unsoundness → **critical**. Incorrect inference in edge cases → **high**. @@ -221,12 +199,9 @@ Structs have value semantics that differ fundamentally from reference types. Inc **CHECK:** - Respect struct semantics: no unnecessary copies, proper byref handling. -- Recognize that C# treats `default(StructType)` and `new StructType()` as valid literal constants for optional parameters, emitting `.param = nullref` in IL. -- Before converting a type to struct, measure the impact — large structs lose sharing and can reduce throughput through data copying. -- Display struct types using the `[]` attribute syntax rather than the `struct` keyword for consistency with modern F# style. -- Investigate and fix incorrect behavior for struct discriminated unions rather than working around it. -- Use struct tuples instead of ref tuples for retained symbol use data to reduce heap allocations. -- Always add tests for struct variants when changing union type behavior. +- Before converting a type to struct, measure the impact — large structs lose sharing and can reduce throughput. +- Always add tests for struct variants when changing union or record type behavior. +- Investigate and fix incorrect behavior for struct types rather than working around it. **Severity:** Incorrect struct copy semantics → **critical**. Missing struct tests → **high**. Style → **low**. @@ -239,14 +214,10 @@ Structs have value semantics that differ fundamentally from reference types. Inc The compiler service must respond to editor keystrokes without unnecessary recomputation. **CHECK:** -- Remove unnecessary `Async.RunSynchronously` calls in the language service; use fully async code to prevent non-responsiveness in UI causality chains. -- Ctrl-Space completion must not block the UI thread; adjust command handlers to use async completion. -- Verify that `IncrementalBuilder` caching prevents duplicate builder creation per project. +- Use fully async code in the language service; avoid unnecessary `Async.RunSynchronously`. - Verify changes do not trigger endless project rechecks. -- Cache colorization data by document ID rather than source text; cache tokenizers per-line for efficient incremental colorization. -- Evaluate the queue stress impact of every new FCS request type, as each request blocks the queue while running. -- Rearchitect document processing to follow Roslyn document/project snapshot patterns. -- Signature help should show all parameters because F# users rely on this as their primary information source. +- Evaluate the queue stress impact of every new FCS request type. +- Caching must prevent duplicate work per project. - Test IDE changes on large solutions before merging. **Severity:** Endless recheck loop → **critical**. UI thread block → **high**. Missing trace verification → **medium**. @@ -262,11 +233,9 @@ Overload resolution is one of the most complex and specification-sensitive areas **CHECK:** - Ensure overload resolution follows the language specification precisely. - Verify that language features work correctly with truly overloaded method sets, not just single-overload defaults. -- Changes that loosen overload resolution rules constitute language changes and must be analyzed carefully. -- Use `ExcludeHiddenOfMethInfos` consistently to filter methods during SRTP resolution, not just normal name resolution. +- Changes that loosen overload resolution rules constitute language changes and need careful analysis. - Apply method hiding filters consistently in both normal resolution and SRTP constraint solving paths. -- When adding nullable parameter interop, add rules preferring `X` over `Nullable` and compare full argument lists as a last-resort tiebreaker. -- For complex SRTP corner cases, the implementation effectively serves as the specification; changes must pin existing behavior with tests. +- For complex SRTP corner cases, changes must pin existing behavior with tests. **Severity:** Overload resolution regression → **critical**. SRTP behavior change → **high**. Missing test → **medium**. @@ -300,18 +269,11 @@ Async and concurrent code must handle cancellation, thread safety, and exception **CHECK:** - Thread cancellation tokens through all async operations. All FCS requests must support cancellation. -- Thread `CancellationToken` through each step of the incremental build graph so FCS respects cancellation in a timely way. -- Use `CommonRoslynHelpers.StartAsyncAsTask` with the `cancellationToken` parameter instead of manual task creation. -- Time-sliced type checking must respect a `CancellationToken`; adjust `Eventually` combinators to support cancellation. -- Pass `CancellationToken` into type provider calls to allow cooperative cancellation. -- Ensure thread-safety for shared mutable state. Avoid global mutable state in FCS. -- Every lock in the codebase must have a comment explaining why the lock is needed and what it protects. -- Prefer `System.Collections.Immutable` collections over mutable collections used as read-only. -- TAST, AbstractIL, TcImports, and NameResolver data structures are not inherently thread-safe — ensure proper synchronization. +- Ensure thread-safety for shared mutable state. Avoid global mutable state. +- Every lock must have a comment explaining what it protects. - Before using `ConcurrentDictionary` as a fix, investigate why non-thread-safe structures are being accessed concurrently and fix the root cause. -- Use token passing to formalize concurrency assumptions: `CompilationThreadToken` for single-threaded compiler phases, lock tokens for lock-protected caches. - Do not swallow `OperationCanceledException` in catch-all handlers. -- Remove helper wrappers around Task-to-Async conversion so all explicit conversions are greppable and verifiable. +- Do not add catch-all exception handlers. **Severity:** Race condition or data corruption → **critical**. Swallowed cancellation → **high**. Missing async test → **medium**. @@ -324,12 +286,10 @@ Async and concurrent code must handle cancellation, thread safety, and exception Incremental checking must invalidate stale results correctly. Stale data causes timing-dependent glitches. **CHECK:** -- Replace old `IsResultObsolete` logic with proper cancellation token checking so stale requests are cancelled rather than run to completion and discarded. -- Avoid returning stale type checking results; prefer fresh results under a timeout or cancellation. -- Use existing `IncrementalBuilder` background project check events (`ProjectChecked`) to obtain full project results rather than triggering redundant checks. -- Time-sliced foreground checking must use a consistent error logger across all time slices to avoid missing errors. -- Verify that project setup handles the case of a clean solution or unrestored packages without silently dropping references. -- Ensure parse/check caches are properly populated with correct thread annotations. +- Avoid returning stale type checking results; prefer fresh results with cancellation support. +- Verify that caching prevents redundant checks and that cache invalidation is correct. +- Verify that project setup handles clean solutions or unrestored packages without silently dropping references. +- Ensure error loggers are consistent across all checking phases to avoid missing errors. **Severity:** Stale results causing glitches → **critical**. Missed invalidation → **high**. Missing cache verification → **medium**. @@ -342,14 +302,10 @@ Incremental checking must invalidate stale results correctly. Stale data causes AST nodes must accurately represent source code. Parser changes are high-risk because they affect every downstream phase. **CHECK:** -- Update all pattern matches on `SynBinding` in tree-walking code when modifying its shape. -- Remove default wildcard patterns in type walkers to catch missing cases at compile time. -- Handle all cases of discriminated unions including spreads in tree walkers. +- Update all pattern matches in tree-walking code when modifying AST node shapes. +- Remove default wildcard patterns in discriminated union walkers to catch missing cases at compile time. - Gate parser changes behind the appropriate language version. -- Add new parser cases to existing rules that share the same prefix. -- Visit spread types in `FileContentMapping` to build correct graph edges. -- Assess compound expression compatibility when introducing new syntax to avoid breaking existing code. -- Remove unused AST nodes created for experimental features that were not pursued. +- Assess expression compatibility when introducing new syntax to avoid breaking existing code. **Severity:** Incorrect AST node → **critical**. Missing walker case → **high**. Ungated parser change → **high**. @@ -381,11 +337,8 @@ Error and warning messages are the compiler's user interface. They must be preci **CHECK:** - Structure error messages as: error statement, then analysis, then actionable advice. -- Format suggestion messages as single-line when `--vserrors` is enabled. -- Emit a warning rather than silently ignoring unsupported default parameter values. -- Eagerly format diagnostics at production time to prevent parameter leakage in parallel checking. -- Only rename error identifiers to reflect actual error message content. -- Enable unused variable warnings by default in projects to catch common bugs. +- Emit a warning rather than silently ignoring unsupported values or options. +- Eagerly format diagnostics at production time to prevent parameter leakage across threads. **Severity:** Misleading diagnostic → **high**. Inconsistent format → **medium**. Wording improvement → **low**. @@ -399,9 +352,7 @@ Debug stepping, breakpoints, and locals display must work correctly. Debug exper **CHECK:** - Ensure debug points and sequence points enable correct stepping behavior. -- Manually verify debug stepping for loops, while loops, task code, list/array expressions, and sequence expressions. -- Ensure all required FSharp.Core target framework builds are produced to avoid `FileNotFoundException` in VS debugging. -- Plan ahead and make sure basic writers can emit debug information before plumbing it through from the type checker. +- Verify debug stepping for loops, task code, and sequence expressions when changing control flow codegen. - Solve debuggability problems generally through techniques that also apply to user-written code. **Severity:** Breakpoint regression → **critical**. Debug stepping regression → **high**. Missing manual verification → **medium**. @@ -416,13 +367,9 @@ New features must be gated behind language version checks. Breaking changes requ **CHECK:** - Gate new language features behind a `LanguageFeature` flag even if shipped as bug fixes. -- Ship experimental features off-by-default and discuss enablement strategy with stakeholders. +- Ship experimental features off-by-default. - Factor out cleanup changes separately from feature enablement. -- Assess compound expression compatibility when introducing new syntax. - Reject changes that alter the C#/.NET visible assembly surface as breaking changes. -- Create an RFC retroactively for breaking changes that were merged without one. -- Clarify whether IDE features are gated behind specific warning flags or are always active. -- Breaking behavior changes to resource naming must be gated behind an opt-in property. **Severity:** Ungated breaking change → **critical**. Missing RFC → **high**. Bundled cleanup+feature → **medium**. From be10cbecf99a0626bbbcc0db90331225dd9d86cc Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 15:02:59 +0100 Subject: [PATCH 10/18] Deduplicate across agent, skill, and instructions - Agent: Replace 15-row folder hotspot table with pointer to skill (skill already has the dimension selection routing table) - Skill: Trim self-review checklist from 15 to 7 items + pointer to agent Remove 'ordered by blocking frequency' provenance leak - Instructions: Remove Test Discipline section (copilot-instructions.md already covers testing), remove Code Generation section (overfitted BindUnitVars/Expr.Lambda rules from single PRs) 45 -> 32 lines Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 18 +------------- .../instructions/ExpertReview.instructions.md | 21 ++++------------ .github/skills/expert-review/SKILL.md | 24 +++++++------------ 3 files changed, 13 insertions(+), 50 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index 05e503b2f0..0b0ffe8fbd 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -489,20 +489,4 @@ Execute review in five waves, each building on the previous. ## Folder Hotspot Mapping -| Directory | Primary Dimensions | Secondary Dimensions | -|---|---|---| -| `src/Compiler/Checking/` | Type System (8), Overload Resolution (11), Struct Awareness (9) | Feature Gating (19), Diagnostics (17) | -| `src/Compiler/CodeGen/` | IL Emission (5), Debug Experience (18) | Struct Awareness (9), Test Coverage (1) | -| `src/Compiler/SyntaxTree/` | Parser Integrity (15), Typed Tree (8) | Feature Gating (19), Debug Experience (18) | -| `src/Compiler/TypedTree/` | Binary Compatibility (12), Type System (8), Typed Tree (8) | Backward Compat (3), Memory Footprint | -| `src/Compiler/Optimize/` | IL Emission (5), Optimization Correctness (6) | Test Coverage (1) | -| `src/Compiler/AbstractIL/` | IL Emission (5) | Backward Compat (3), Memory Footprint | -| `src/Compiler/Driver/` | Build Infrastructure, Incremental Checking (14) | Compiler Performance, Cancellation (13) | -| `src/Compiler/Service/` | FCS API Surface (7), IDE Performance (10), Concurrency (13) | Test Coverage (1), Incremental Checking (14) | -| `src/Compiler/Facilities/` | Feature Gating (19), Concurrency (13) | Naming | -| `src/Compiler/FSComp.txt` | Diagnostic Quality (17) | — | -| `src/FSharp.Core/` | FSharp.Core Stability (2), XML Doc, Backward Compat (3) | RFC Process (4), Test Coverage (1) | -| `src/FSharp.Compiler.LanguageServer/` | IDE Performance (10), Editor UX | Concurrency (13) | -| `vsintegration/` | IDE Responsiveness (10), Memory Footprint | Build Infrastructure, Cross-Platform | -| `tests/` | Test Coverage (1), Compiler Performance | Backward Compat (3), Cross-Platform | -| `eng/`, `setup/` | Build Infrastructure | Cross-Platform | +See the `reviewing-compiler-prs` skill for the dimension selection table mapping files → dimensions. diff --git a/.github/instructions/ExpertReview.instructions.md b/.github/instructions/ExpertReview.instructions.md index a7464b801f..94ad29f5db 100644 --- a/.github/instructions/ExpertReview.instructions.md +++ b/.github/instructions/ExpertReview.instructions.md @@ -5,18 +5,11 @@ applyTo: # Compiler Review Rules -## Test discipline - -- Add tests for every behavioral change before merging — missing coverage is the most common review blocker. -- Explain all new errors in test baselines and confirm each is expected behavior, not a regression. -- Test on IL-defined members, not just F#-defined ones. Check `UseNullAsTrueValue` edge cases. - ## Type safety -- Use `tryTcrefOfAppTy` or the `AppTy` active pattern instead of direct `TType` pattern matches — direct matches miss erased types. -- Always call `stripTyEqns` before pattern matching on types. Never match `TType_app` directly. +- Always call `stripTyEqns` before pattern matching on types. Use `AppTy` active pattern, not `TType_app` directly. - Remove default wildcard patterns in discriminated union matches so the compiler catches missing cases. -- Raise internal compiler errors for unexpected type forms (`TType_ucase`) rather than returning defaults. +- Raise internal compiler errors for unexpected type forms rather than returning defaults. ## Feature gating @@ -24,13 +17,7 @@ applyTo: - Factor cleanup changes into separate commits from feature enablement. - Major language changes require an RFC before implementation. -## Code generation - -- Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. -- Verify no tail-call behavior changes — check IL diffs before and after. -- After stripping a lambda for a delegate, bind discarded unit parameters using `BindUnitVars`. - -## Backward compatibility +## Binary compatibility - Codegen changes that depend on new FSharp.Core functions must guard against older FSharp.Core versions. - Do not alter the C#/.NET visible assembly surface without treating it as a breaking change. @@ -42,4 +29,4 @@ applyTo: ## Performance -- Performance claims require `--times` output, benchmarks, or profiler evidence. Do not inline large functions without data. +- Performance claims require `--times` output, benchmarks, or profiler evidence. diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index 47ee40f9d6..8f2458b96a 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -55,20 +55,12 @@ Dispatch one agent per selected dimension. For high-confidence reviews, assess e ## Self-Review Checklist -Ordered by blocking frequency: - -1. [ ] Every behavioral change has a test — **most common review blocker** -2. [ ] FSharp.Core changes maintain strict binary compatibility -3. [ ] No unintended public API surface changes (FCS or FSharp.Core) +1. [ ] Every behavioral change has a test +2. [ ] FSharp.Core changes maintain binary compatibility +3. [ ] No unintended public API surface changes 4. [ ] New language features have a `LanguageFeature` guard and RFC -5. [ ] Overload resolution and type inference behavior preserved -6. [ ] No raw `TType_*` pattern matching without `stripTyEqns` -7. [ ] Cancellation tokens threaded through async operations -8. [ ] IL emission verified in both Debug and Release -9. [ ] Struct semantics respected (no unnecessary copies) -10. [ ] Cleanup changes separate from feature enablement -11. [ ] Test baselines updated with explanations -12. [ ] Tests pass in both Debug and Release -13. [ ] Error messages follow: statement → analysis → advice -14. [ ] No catch-all exception handlers; `OperationCanceledException` handled specially -15. [ ] Debug stepping manually verified for affected control flow +5. [ ] No raw `TType_*` matching without `stripTyEqns` +6. [ ] Cancellation tokens threaded through async operations +7. [ ] Cleanup changes separate from feature enablement + +Full dimension CHECK rules are in the `expert-reviewer` agent. From 8b8c0370d39c6fdb34ea206d850858d74d6ce2e8 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 15:14:59 +0100 Subject: [PATCH 11/18] Improve clarity: add 'why' context, resolve contradictions, generalize rules MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Confusion audit found 68% A-grade (clear), 29% B (needs context), 3% C (overfitted). Applied fixes: - Add 'why' rationale to 15 items (stripTyEqns, InlineIfLambda, FCS dependency minimization, queue stress, method hiding) - Resolve Dim 16 contradiction: catch-all OK at API boundaries (log, don't silence), forbidden inside compiler - Generalize 3 C-grade items: Expr.Lambda→expression nodes, tuple-match →expression restructuring, anon records→any metadata emission - Rewrite 3 factoid items as actionable rules (CE semantics, performance comparison, type provider framework compat) - Expand jargon: FCS, stream B tag-byte, closure capturing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 59 +++++++++++++++---------------- 1 file changed, 28 insertions(+), 31 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index 0b0ffe8fbd..6cf3138fe1 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -27,17 +27,17 @@ Evaluates F# compiler changes across 19 dimensions. Use the `reviewing-compiler- Push back against these recurring patterns: 1. **Reformatting + logic in one PR** — Separate formatting into its own PR. Mixed diffs obscure logic changes and block review. -2. **Catch-all exception handlers** — Do not add catch-all exception handlers. Handle `OperationCanceledException` specially; never swallow it. -3. **Internal type leakage via InternalsVisibleTo** — Internal compiler data structures must not leak through the FCS API boundary. +2. **Catch-all exception handlers** — Do not add catch-all exception handlers. Handle `OperationCanceledException` specially; never swallow it. (Exception: top-level language service entry points should catch all to prevent IDE crashes — but log, don't silence.) +3. **Internal type leakage** — Internal compiler data structures must not leak through the FCS (F# Compiler Service) public API. Leakage creates permanent API commitments from implementation details. 4. **Performance claims without data** — Require benchmarks, `--times` output, or profiler evidence for any performance claim. -5. **Raw TType_* pattern matching** — Never match on `TType_*` without first calling `stripTyEqns`. Use `AppTy` active pattern, not `TType_app` directly. -6. **Verbose inline logging** — Use declarative tracing; inline ETW calls degrade code readability. +5. **Raw TType_* pattern matching** — Never match on `TType_*` without first calling `stripTyEqns` (which resolves type abbreviations and equations). Skipping it causes missed matches on aliased/abbreviated types. Use `AppTy` active pattern instead of `TType_app`. +6. **Verbose inline logging** — Prefer structured/declarative tracing over inline logging calls that clutter the code. 7. **Conditional serialization writes** — Writes gated on compiler flags (LangVersion, feature toggles, TFM) produce misaligned byte streams for cross-version metadata. The byte count must depend only on stream-encoded data. -8. **Stale type-checking results** — Avoid returning stale results; they cause timing-dependent IntelliSense glitches. Prefer fresh results under a timeout or cancellation. -9. **Global mutable state in FCS** — Pass dependencies explicitly as parameters to enable proper snapshot-based processing. +8. **Stale type-checking results** — Avoid returning stale results; they cause timing-dependent IntelliSense glitches. Prefer fresh results with cancellation support. +9. **Global mutable state** — Pass dependencies explicitly as parameters rather than using module-level mutable globals, to enable concurrent and snapshot-based processing. 10. **Missing XML doc comments** — Every new top-level function, module, and type definition must have a `///` comment. -11. **Shell script wrappers** — Adding batch files to encapsulate process invocations obscures things. Prefer MSBuild targets. -12. **Large closures capturing unnecessary data** — Verify that no additional data is being captured by long-lived objects. Use `ConditionalWeakTable` for caches keyed by GC-collected objects. +11. **Shell script wrappers** — Prefer MSBuild targets over batch/shell scripts — scripts obscure build logic and break cross-platform. +12. **Large closures capturing unnecessary data** — Verify that long-lived closures don't capture more data than needed, causing memory leaks. 13. **Returning `Unchecked.defaultof<_>` to swallow exceptions** — This hides the root cause. Investigate and fix exception propagation failures. 14. **Band-aid ad-hoc patches** — Flag `if condition then specialCase else normalPath` that patches a consumer rather than fixing the source. @@ -78,10 +78,10 @@ FSharp.Core is the one assembly every F# program references. Changes here have o - Maintain strict backward binary compatibility. No public API removals or signature changes. - Verify compilation order constraints — FSharp.Core has strict file ordering requirements. - Add unit tests to `FSharp.Core.Tests` for every new or changed function. -- Minimize FCS's FSharp.Core dependency — do not add new references from the compiler to FSharp.Core unnecessarily. +- Minimize FCS's FSharp.Core dependency — the compiler should be hostable with different FSharp.Core versions; unnecessary coupling breaks this. - XML doc comments are mandatory for all public APIs in FSharp.Core. - New API additions require an RFC. -- Apply `InlineIfLambda` to inlined functions taking a lambda applied only once. +- Apply `InlineIfLambda` to inlined functions taking a lambda applied only once — this enables the lambda to be inlined at the call site, eliminating closure allocation. **Severity:** Binary compat break in FSharp.Core → **critical**. Missing tests → **high**. Missing XML docs → **medium**. @@ -132,7 +132,7 @@ Code generation must produce correct, verifiable IL. Wrong IL produces silent ru - Verify no changes in tail-calling behavior — check IL diffs before and after. - Test code changes with optimizations both enabled and disabled. - Solve debuggability or performance problems generally through techniques that also apply to user-written code, not special-cased optimizations. -- Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. +- When matching on expression nodes during codegen, handle debug-point wrapper nodes to prevent IL stack corruption. **Severity:** Incorrect IL → **critical**. Debug stepping regression → **high**. Missing IL test → **medium**. @@ -147,9 +147,8 @@ Optimizer changes must preserve program semantics. Inlining and tail-call change **CHECK:** - Verify optimizations preserve program semantics in all cases. - Tail call analysis must correctly handle all cases including mutual recursion, not just simple self-recursion. -- Prefer general approaches like improved inlining that cover many cases at once over hand-implementing function-by-function optimizations. -- Prefer a set of orthogonal decisions/optimizations that work for all code, including user-defined code, rather than just library functions. -- Verify that tuple-and-match reformulations produce code at least as good as before. +- Prefer general approaches (e.g., improved inlining) that cover many cases at once over hand-implementing function-by-function optimizations. +- Verify that expression restructuring optimizations don't regress code quality — compare IL before and after. - Require performance evidence for optimization changes. **Severity:** Semantic-altering optimization → **critical**. Tail-call regression → **high**. Missing evidence → **medium**. @@ -181,11 +180,9 @@ The FCS public API is a permanent commitment. Internal types must never leak. Type checking and inference must be sound. Subtle bugs in constraint solving, overload resolution, or scope handling produce incorrect programs. **CHECK:** -- Always call `stripTyEqns`/`stripTyEqnsA` before pattern matching on types. Use `AppTy` active pattern, not `TType_app` directly. -- Use `tryTcrefOfAppTy` or typed active patterns instead of direct `TType` pattern matches. -- Raise internal compiler errors for unexpected type forms rather than returning defaults. -- Use precise type predicates when matching on the typed tree. -- Use property accessors on IL info types rather than explicit pattern matching. +- Always call `stripTyEqns`/`stripTyEqnsA` before pattern matching on types — this resolves type abbreviations and inference equations. Without it, aliased types won't match and code silently takes the wrong branch. Use `AppTy` active pattern instead of matching `TType_app` directly. +- Raise internal compiler errors for unexpected type forms rather than returning defaults — silent defaults hide bugs. +- Use property accessors on IL metadata types (e.g., `ILMethodDef` properties) rather than deconstructing them directly — the internal representation may change. **Severity:** Type system unsoundness → **critical**. Incorrect inference in edge cases → **high**. @@ -216,7 +213,7 @@ The compiler service must respond to editor keystrokes without unnecessary recom **CHECK:** - Use fully async code in the language service; avoid unnecessary `Async.RunSynchronously`. - Verify changes do not trigger endless project rechecks. -- Evaluate the queue stress impact of every new FCS request type. +- Evaluate the queue stress impact of every new FCS request type — each request blocks the service queue while running, so expensive requests delay all other IDE features. - Caching must prevent duplicate work per project. - Test IDE changes on large solutions before merging. @@ -234,7 +231,7 @@ Overload resolution is one of the most complex and specification-sensitive areas - Ensure overload resolution follows the language specification precisely. - Verify that language features work correctly with truly overloaded method sets, not just single-overload defaults. - Changes that loosen overload resolution rules constitute language changes and need careful analysis. -- Apply method hiding filters consistently in both normal resolution and SRTP constraint solving paths. +- Apply method hiding filters (removing base-class methods overridden by derived-class methods) consistently in both normal resolution and SRTP constraint solving paths. - For complex SRTP corner cases, changes must pin existing behavior with tests. **Severity:** Overload resolution regression → **critical**. SRTP behavior change → **high**. Missing test → **medium**. @@ -249,11 +246,11 @@ The pickled metadata format is a cross-version contract. DLLs compiled with any **CHECK:** - Never remove, reorder, or reinterpret existing serialized data fields. -- Ensure new data is invisible to old readers (stream B with tag-byte detection). +- Ensure new data is invisible to old readers (added to stream B with tag-byte detection — old readers get default `0` past end-of-stream). - Exercise old-compiler-reads-new-output and new-compiler-reads-old-output for any metadata change. - Verify the byte count does not depend on compiler configuration (feature flags, LangVersion, TFM) — only on stream-encoded data. -- Add cross-version compatibility tests when changing anonymous record or constraint emission. -- Before modifying the TAST or pickle format for a new feature, confirm whether the attribute can be handled entirely via existing IL metadata without changing internal representations. +- Add cross-version compatibility tests for any change to metadata emission. +- Before modifying the typed tree or pickle format, check whether the feature can be expressed through existing IL metadata without changing internal representations. **Severity:** Any metadata format breakage → **critical**. Missing compat test → **high**. @@ -322,8 +319,8 @@ Exception handling must be precise. Catch-all handlers and swallowed exceptions - Never swallow exceptions silently; handle `OperationCanceledException` specially. - Do not suppress task completion by silently ignoring `TrySetException` failures. - Returning `Unchecked.defaultof<_>` to swallow exceptions is dangerous — investigate and fix the root cause. -- Protect language service entry points against all exceptions to prevent IDE crashes. -- Do not add catch-all exception handlers. +- At language service API boundaries (top-level entry points called by the IDE), catch all exceptions to prevent IDE crashes — but log them, don't silence them. +- Inside the compiler, do not add catch-all exception handlers — they hide bugs. **Severity:** Swallowed cancellation → **critical**. Catch-all handler → **high**. Missing error → **medium**. @@ -382,7 +379,7 @@ New features must be gated behind language version checks. Breaking changes requ #### Compiler Performance Measurement - Require `--times` output, benchmarks, or profiler data for performance claims. -- Acknowledge that F# compiler service operations are intrinsically more expensive than C# (larger work chunks, no parallelism, more expensive type inference). +- Do not compare F# build times directly to C# (Roslyn) — F# type inference and checking are structurally more expensive. Compare to previous F# baselines instead. - Measure and report build time impact. #### Memory Footprint Reduction @@ -404,13 +401,13 @@ New features must be gated behind language version checks. Breaking changes requ #### Computation Expression Semantics -- Tail call behavior in seq and async builders depends on builder implementation, not IL tail calls. -- Prefer orthogonal decisions that work for all code, including user-defined CEs. +- Test deep recursion in CEs (seq, async, task) — tail call behavior depends on the builder implementation, not IL tail call instructions. +- Prefer designs that work for all CEs including user-defined builders, not just built-in ones. #### Type Provider Robustness -- Handle type provider failures gracefully without crashing the compiler. -- Design-time hosting and framework compatibility must be considered. +- Handle type provider failures gracefully without crashing the compiler — type providers run user code at compile time. +- Test type provider scenarios across target frameworks (desktop vs CoreCLR). #### Signature File Discipline From 2c1af569cd5ae867b1432790a7b487e2f4dac26a Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 15:19:57 +0100 Subject: [PATCH 12/18] Update extraction pipeline with lessons from deep run MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Key improvements to the pipeline methodology: Data collection: - MUST collect ALL activity, not sample (sampling produced 17x less data) - Include closed/rejected PRs (strongest review opinions) - Include own PRs (PR descriptions = design intent) - Search both commenter: AND author: queries - Batch size 15 items per agent (not 200 — agents fail on large batches) Anti-overfitting: - Normalize by PR count not comment count (1 PR = 1 vote) - Classify into generalizable principles, not PR-specific advice - Cross-check rules against CI config — don't reproduce what CI enforces - Validate technical terms against current codebase (grep for 0 matches) Quality gates: - Add Phase 5.6: Confusion audit (grade A/B/C/D on every CHECK item) - Target >=80% grade A, 0% grade C/D - Re-dispatch failed multi-batch agents (validate file sizes >500 bytes) 7 new lessons learned (#8-14) from this session's experience. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/extraction-pipeline.md | 82 +++++++++++++++++++++------ 1 file changed, 64 insertions(+), 18 deletions(-) diff --git a/.github/agents/extraction-pipeline.md b/.github/agents/extraction-pipeline.md index e8e06b23c2..dedd18c1ea 100644 --- a/.github/agents/extraction-pipeline.md +++ b/.github/agents/extraction-pipeline.md @@ -50,16 +50,24 @@ If `reference_repos` are specified and the pipeline needs to search their code ( ## Phase 1: Data Collection +**Completeness is critical.** Do not sample — collect ALL activity. A reviewer who leaves one precise comment on a well-written PR teaches as much as 50 comments on a messy one. Sampling biases toward noisy PRs and misses the signal in clean approvals. + ### 1.1 Index all activity -Search each repo for issues, PRs, and discussions where `username` participated. GitHub search returns max 1000 results per query — split by date ranges to capture everything. +Search each repo for issues, PRs, and discussions where `username` participated. **Include ALL states** — open, closed, merged, and rejected PRs all carry learning potential. Rejected PRs often contain the strongest review opinions. + +GitHub search returns max 1000 results per query — split by 1-year date ranges to capture everything. For high-volume users, split by 6 months. -For each repo: +For each repo, run FOUR searches (not two — capture both commenter and author roles): ``` -search_issues: commenter:{username} created:{year_start}..{year_end} search_pull_requests: commenter:{username} created:{year_start}..{year_end} +search_pull_requests: author:{username} created:{year_start}..{year_end} +search_issues: commenter:{username} created:{year_start}..{year_end} +search_issues: author:{username} created:{year_start}..{year_end} ``` +**Own PRs are first-class data.** When the user is the PR author, their PR description reveals design intent, priorities, and rationale that never appears in review comments. Tag each item with the user's role: `reviewer`, `author`, or `both`. + For discussions (if the repo uses them), use the GitHub GraphQL API: ```graphql query { @@ -69,22 +77,25 @@ query { } ``` -Store in SQLite (`gh_activity` table): repo, type (issue/pr/discussion), number, title, state, created_at, updated_at, labels, url, author. +Store in SQLite (`gh_activity` table): repo, type (issue/pr/discussion), number, title, state, created_at, updated_at, labels, url, author, user_role. -Parallelize across repos and date ranges. Use sub-agents for large volumes. +Parallelize across repos and date ranges. Use sub-agents for large volumes. Paginate ALL results — do not stop at page 1. ### 1.2 Collect actual comments -For each indexed item, fetch the user's actual comments: -- **Issues**: `issue_read` → `get_comments` → filter to username -- **PRs — general comments**: `pull_request_read` → `get_comments` → filter to username +For EVERY indexed item (not a sample), fetch the user's actual words. **All comment types matter:** + +- **PR descriptions** (when user is author): `pull_request_read` → `get` → save body. These reveal design intent and priorities — often the most valuable content. +- **PRs — general comments**: `pull_request_read` → `get_comments` → filter to username. This is the primary comment channel for many reviewers. - **PRs — review comments** (code-level, with file path + diff hunk): `pull_request_read` → `get_review_comments` → filter to username -- **PRs — reviews** (approval/request-changes with summary body): `pull_request_read` → `get_reviews` → filter to username. These carry the reviewer's top-level verdict and summary — often the most opinionated content. -- **Discussions**: Use GraphQL to fetch comment nodes filtered to username. Discussion comments often contain design rationale and architectural decisions. +- **PRs — reviews** (approval/request-changes with summary body): `pull_request_read` → `get_reviews` → filter to username. These carry the reviewer's top-level verdict and summary — often the most opinionated content. Skip reviews with empty bodies. +- **Issues — body** (when user is author): save the issue body as a comment. +- **Issues — comments**: `issue_read` → `get_comments` → filter to username +- **Discussions**: Use GraphQL to fetch comment nodes filtered to username. -Store in SQLite (`user_comments` table): comment_id, activity_id, repo, comment_type (issue_comment, review_comment, pr_comment, review_summary, discussion_comment), body, created_at, file_path, diff_hunk, url. +Store in SQLite (`user_comments` table): comment_id, activity_id, repo, comment_type (pr_description, issue_description, issue_comment, review_comment, pr_comment, review, discussion_comment), body, created_at, file_path, diff_hunk, url. -This is the most API-intensive phase. Batch into sub-agents by date range. Handle rate limits with retry. +This is the most API-intensive phase. Batch into sub-agents of ~15 items each — small batches ensure each agent completes reliably. Parallelize aggressively. Handle rate limits with retry. ### 1.3 Collect PR context @@ -127,22 +138,26 @@ Store as a feature area reference table in SQLite. For each collected comment, classify using a sub-agent (Opus). **Do not use a hardcoded category list** — derive categories from the data: -1. **Bootstrap pass**: Take a random sample of ~200 comments. Ask a sub-agent to read them and propose a category taxonomy that fits this specific reviewer and codebase. The agent should identify recurring themes, name them, and define each in one sentence. Expect 15–40 categories to emerge. +1. **Bootstrap pass**: Take a random sample of ~300 comments from diverse PRs (not just the PRs with the most comments). Ask a sub-agent to read them and propose a category taxonomy. The agent should identify recurring themes, name them, and define each in one sentence. Expect 15–40 categories to emerge. -2. **Classification pass**: Using the derived taxonomy, classify all comments in batches (~200 per sub-agent). For each comment extract: +2. **Classification pass**: Using the derived taxonomy, classify all comments in batches (~15 PR packets per sub-agent, where each packet includes all comments on that PR). For each comment extract: - **Categories** (one or more, from the derived taxonomy) - **Feature area**: map to the landing repo's code structure (from 2.1) - **File/folder**: which code path does this apply to - - **Sentiment**: approval, concern, suggestion, question, blocking - **Severity**: trivial, minor, moderate, major, critical - - **Focus point**: what specifically is being addressed - - **Derived rule**: actionable rule extracted from the comment + - **Derived rule**: actionable rule extracted from the comment, phrased as a generalizable principle — not tied to the specific PR 3. **Taxonomy refinement**: After the first full pass, review category distribution. Merge categories with <5 occurrences into broader ones. Split categories with >500 occurrences if they contain distinct sub-themes. Re-classify affected comments. +**Anti-overfitting rules for classification:** +- **Normalize by PR, not by comment.** A PR with 50 comments gets weight=1, same as a PR with 1 comment. Count how many PRs a rule appears in, not how many comments mention it. The reviewer saying something once on a clean PR means the same as repeating it 10 times on a messy one. +- **Generalize, don't transcribe.** The derived rule must be applicable to a future PR the classifier has never seen. "Always call stripTyEqns before matching types" is good. "Call stripTyEqns on line 47 of CheckPatterns.fs" is overfitted. +- **Distinguish reviewer opinion from CI enforcement.** If the reviewer says "please add tests", that's a review rule. If the reviewer says "run tests on Linux and Windows", that might just mean "CI should cover this" — not a rule for human reviewers. Check: does the repo's CI already do this? If yes, don't encode it as a review rule. +- **Distinguish design guidance from implementation instruction.** "Gate features behind LanguageFeature flags" is design guidance (always applicable). "Use BindUnitVars after stripping the lambda" is an implementation instruction for a specific code path (only applicable when touching that code). + Store in SQLite (`comment_analysis` table). -Process in batches. Use sub-agents — each handles ~200 comments. Run in parallel. +Process in batches. Use sub-agents — each handles ~15 PR packets with full context. Run in parallel. ### 2.3 Clustering @@ -364,6 +379,23 @@ Verify the three layers work together: - No body overlap between instructions and AGENTS.md/copilot-instructions.md - Agent doesn't repeat AGENTS.md content +### 5.6 Overfitting verification + +Run a confusion audit: give a fresh-context sub-agent (with NO knowledge of the source data) the generated agent file and ask it to grade every CHECK item on a confusion scale: + +- **A (Clear)**: Any competent developer could apply this rule to a random future PR. +- **B (Needs context)**: Rule is correct but needs a "why" or "when" — add one sentence of rationale. +- **C (Overfitted)**: Rule is too specific to one historical scenario — generalize or remove. +- **D (Confusing)**: Rule is ambiguous or contradictory — rewrite. + +Target: ≥80% grade A, 0% grade C/D. Fix all C/D items. Add "why" to all B items. + +Also check: +- No rules that reproduce what CI already enforces (e.g., "run tests on Linux" when CI covers all platforms) +- No rules referencing specific function names or line numbers unless those functions are long-lived stable APIs +- Every CHECK item is phrased as a generalizable principle, not a transcription of one PR's feedback +- Dimension frequency is counted by PRs, not by comments — a PR with 50 comments counts the same as one with 1 comment + --- ## Lessons Learned (encoded in this process) @@ -383,3 +415,17 @@ Failure modes observed during development. The process above accounts for them, 6. **Separate reference files get skipped**: Content in a separate file was not reliably read by the model. The fix: inline critical content in the agent file — it's loaded on invocation, guaranteed to be read. 7. **Description is everything for discovery**: The YAML `description` field is how the model picks from 100+ skills. Invest in keyword-rich, third-person, specific trigger descriptions. + +8. **Sampling bias toward noisy PRs**: The first run sampled ~300 PRs from ~5000 and got 469 comments — biased toward high-comment PRs. The deep run collected ALL ~8000 comments from ALL PRs. The difference was 17x more data and qualitatively different dimensions emerged. Never sample. + +9. **Comment-count overfitting**: A PR with 50 review comments dominated the taxonomy, producing CHECK items specific to that one PR. The fix: normalize by PR count, not comment count. One PR = one vote regardless of how many comments it has. + +10. **Reproducing CI as review rules**: The classifier extracted "run tests on Linux and Windows" as a review rule — but CI already does this. Reviewers mention CI coverage in passing; it's not a rule for future reviewers. The fix: cross-reference with the repo's CI config and exclude rules that CI already enforces. + +11. **Overfitted CHECK items from single PRs**: Rules like "cross-reference GenFieldInit in IlxGen.fs when modifying infos.fs" are useless for a reviewer who isn't modifying that specific file. The fix: the confusion audit (Phase 5.6) catches these. Every CHECK item must be applicable to a future PR about a feature that doesn't exist yet. + +12. **Obsolete terminology**: The classifier picked up historical terms ("reactor thread") that no longer exist in the codebase. The fix: cross-validate all technical terms against the current codebase with grep. Zero matches = remove the term. + +13. **Missing own-PR data**: The first run only searched `commenter:username`, missing PRs where the user was the author. PR descriptions by the target user are often the richest source of design philosophy. The fix: always search both `commenter:` AND `author:`. + +14. **Multi-batch agent failures**: Sub-agents given 9+ batches of 15 PRs often gave up partway or wrote placeholder files. The fix: keep batch assignments to 4-5 batches per agent, validate output file sizes (>500 bytes), and re-dispatch failures. From 76d1e16327d46bf58705a7ba04c0ec05b74013e6 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 15:27:53 +0100 Subject: [PATCH 13/18] Inline all lessons learned into pipeline directives, remove appendix A fresh-context agent won't read an appendix. All 14 lessons are now direct instructions in the phase where they apply: - Scale Warning: batch size 15, validate output >500 bytes, re-dispatch - Phase 1.1: four searches (commenter+author), all PR states, paginate - Phase 1.2: PR descriptions as first-class data, 15 items per agent - Phase 1.4: grep technical terms against codebase, drop obsolete - Phase 2.2: normalize by PR not comment, generalize don't transcribe, distinguish CI enforcement from review rules - Wave 1: don't explain away issues, no hypotheticals, verify against PR branch not main - Phase 3.1: agent is single source of truth, inline critical content - Phase 3.4: skill points to agent, description is discovery key - Phase 5.6: confusion audit catches overfitted CHECK items 431 -> 404 lines. Appendix removed entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/extraction-pipeline.md | 61 ++++++++------------------- 1 file changed, 17 insertions(+), 44 deletions(-) diff --git a/.github/agents/extraction-pipeline.md b/.github/agents/extraction-pipeline.md index dedd18c1ea..71fb2b3d7e 100644 --- a/.github/agents/extraction-pipeline.md +++ b/.github/agents/extraction-pipeline.md @@ -13,8 +13,8 @@ This pipeline processes **thousands** of GitHub items (typically 3,000–10,000+ **Use sub-agents for everything.** The orchestrator manages SQLite state and dispatches work. Sub-agents do the heavy lifting: - **Collection**: one sub-agent per repo × date range chunk. Parallelize aggressively (6+ concurrent agents). -- **Comment fetching**: one sub-agent per ~200 items. This is the most API-call-intensive phase. -- **Semantic analysis**: one sub-agent per ~200 comments. Each classifies and extracts rules. +- **Comment fetching**: one sub-agent per ~15 items. Small batches ensure reliable completion — agents given 200+ items often give up partway or write placeholder files. +- **Semantic analysis**: one sub-agent per ~15 PR packets. Each classifies and extracts rules. - **Synthesis**: one sub-agent to read all analysis summaries (not raw data) and produce dimensions/principles. - **Artifact generation**: one sub-agent per artifact type (instructions, skills, agent). - **Validation/improvement**: one sub-agent per file or per concern. @@ -23,6 +23,8 @@ Store all intermediate results in **SQLite** (queryable) and **JSON backup files Use `model: "claude-opus-4.6"` for all sub-agents. Use background mode (`mode: "background"`) so agents run in parallel and the orchestrator is notified on completion. +**After each batch of sub-agents completes, validate output files.** Check that each file is >500 bytes and parseable JSON. Re-dispatch any that produced empty/placeholder files. Keep batch assignments to ≤5 batches per agent. + ## Inputs Collect these before starting. If any are missing, ask the user via the `ask_user` tool. @@ -107,14 +109,15 @@ This maps comments to code areas. ### 1.4 Cross-validate against current codebase -Collected data references files and folders as they existed at the time of the comment — migrations and refactorings happen. Before enrichment, reconcile all file paths: +Collected data references files, folders, and terminology as they existed at the time of the comment — migrations and refactorings happen. Reconcile before enrichment: +**File paths:** 1. Extract all unique file paths from collected comments (review comments have `file_path`, PR files have `path`). 2. For each path, check if it exists in the current repo (`Test-Path` or `glob`). 3. If missing, search for the filename in its current location (files get moved between folders). Update the path if found. 4. If the file was deleted entirely, keep the comment's essence (the rule it teaches) but drop the file pointer. The rule may still apply to successor code. -This prevents generating instructions that point at nonexistent files. +**Technical terms:** `grep` every technical term used in comments (function names, type names, concepts like "reactor thread") against the current codebase. Terms with zero matches are obsolete — do not use them in generated artifacts. ### 1.5 Backup @@ -201,9 +204,11 @@ Generate three artifact types: - Reference docs, don't reproduce them #### Review Agent (`.github/agents/{agent_name}.md`) -- Single source of truth for the review methodology -- Contains: overarching principles, all dimensions inline (with rules + CHECK flags), folder hotspot mapping, review workflow +- Single source of truth for the review methodology — all critical content must be inline in this file, not in separate reference files that might not be read +- Contains: overarching principles, all dimensions inline (with rules + CHECK flags), review workflow +- The folder→dimension routing table belongs in the skill (not duplicated here) - The review workflow is 5 waves (see below) +- Every CHECK item must be a generalizable principle applicable to future PRs about features that don't exist yet — not a transcription of one historical PR's feedback **Commit** after raw creation. @@ -230,10 +235,10 @@ Apply https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-pra Compare new artifacts against existing `.github/` content: - Check trigger overlap between new and existing skills -- Check body overlap (same howto in two places) -- Resolve: if complementary (how vs when), add cross-references. If duplicate, merge or delete. -- Instructions must not repeat AGENTS.md or copilot-instructions.md +- Check body overlap (same content in two places) — if the same concept appears in both agent and skill, keep it in the agent (source of truth) and have the skill point to it +- Instructions must not repeat AGENTS.md, copilot-instructions.md, or the agent's CHECK items verbatim — instructions are for concise auto-loaded reminders only - All doc links verified to exist on disk +- The YAML `description` field is how the model picks from 100+ skills — invest in keyword-rich, third-person, specific trigger descriptions **Commit** after deduplication. @@ -278,11 +283,11 @@ Launch **one sub-agent per dimension** (parallel batches of 6). Each evaluates e Sub-agent instructions: -> Report `$DimensionName — LGTM` when the dimension is genuinely clean. +> Report `$DimensionName — LGTM` when the dimension is genuinely clean. Do not explain away real issues to produce a clean result. > -> Report an ISSUE only when you can construct a **concrete failing scenario**: a specific input, a specific call sequence, a specific state that triggers the bug. No hypotheticals. +> Report an ISSUE only when you can construct a **concrete failing scenario**: a specific input, a specific call sequence, a specific state that triggers the bug. No hypotheticals — "this might be a problem in theory" is not a finding. > -> Read the **PR diff**, not main — new files only exist in the PR branch. +> Read the **PR diff**, not main — new files only exist in the PR branch. Never verify findings against `main`; the code you're reviewing only exists in `refs/pull/{pr}/head`. > > Include exact file path and line range. Verify by tracing actual code flow. @@ -396,36 +401,4 @@ Also check: - Every CHECK item is phrased as a generalizable principle, not a transcription of one PR's feedback - Dimension frequency is counted by PRs, not by comments — a PR with 50 comments counts the same as one with 1 comment ---- - -## Lessons Learned (encoded in this process) - -Failure modes observed during development. The process above accounts for them, but they're listed for awareness: - -1. **Nodder bias**: Telling sub-agents "LGTM is the best outcome" caused them to explain away real issues. The correct framing: "LGTM when genuinely clean. Do not explain away real issues." - -2. **Nitpicker bias**: Without LGTM guidance, sub-agents generated 25+ findings including hypotheticals. The fix: require concrete failing scenarios — no "maybe in theory." - -3. **Wrong branch verification**: Verification agents checked `main` instead of the PR branch, disputing real findings because new files didn't exist on `main`. The fix: always verify against PR diff or `refs/pull/{pr}/head`. - -4. **Static-only analysis**: Reading code without tracing execution missed whether issues were real. The fix: Wave 2 requires active validation — build, test, write PoCs, simulate execution traces. - -5. **Duplicate skills**: Two skills covering the same topic emerged. The fix: single source of truth in one file; others are slim pointers. - -6. **Separate reference files get skipped**: Content in a separate file was not reliably read by the model. The fix: inline critical content in the agent file — it's loaded on invocation, guaranteed to be read. - -7. **Description is everything for discovery**: The YAML `description` field is how the model picks from 100+ skills. Invest in keyword-rich, third-person, specific trigger descriptions. - -8. **Sampling bias toward noisy PRs**: The first run sampled ~300 PRs from ~5000 and got 469 comments — biased toward high-comment PRs. The deep run collected ALL ~8000 comments from ALL PRs. The difference was 17x more data and qualitatively different dimensions emerged. Never sample. - -9. **Comment-count overfitting**: A PR with 50 review comments dominated the taxonomy, producing CHECK items specific to that one PR. The fix: normalize by PR count, not comment count. One PR = one vote regardless of how many comments it has. - -10. **Reproducing CI as review rules**: The classifier extracted "run tests on Linux and Windows" as a review rule — but CI already does this. Reviewers mention CI coverage in passing; it's not a rule for future reviewers. The fix: cross-reference with the repo's CI config and exclude rules that CI already enforces. - -11. **Overfitted CHECK items from single PRs**: Rules like "cross-reference GenFieldInit in IlxGen.fs when modifying infos.fs" are useless for a reviewer who isn't modifying that specific file. The fix: the confusion audit (Phase 5.6) catches these. Every CHECK item must be applicable to a future PR about a feature that doesn't exist yet. - -12. **Obsolete terminology**: The classifier picked up historical terms ("reactor thread") that no longer exist in the codebase. The fix: cross-validate all technical terms against the current codebase with grep. Zero matches = remove the term. - -13. **Missing own-PR data**: The first run only searched `commenter:username`, missing PRs where the user was the author. PR descriptions by the target user are often the richest source of design philosophy. The fix: always search both `commenter:` AND `author:`. -14. **Multi-batch agent failures**: Sub-agents given 9+ batches of 15 PRs often gave up partway or wrote placeholder files. The fix: keep batch assignments to 4-5 batches per agent, validate output file sizes (>500 bytes), and re-dispatch failures. From 202ea2d7aec5e239558c49c00d228c4724021e3a Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 15:40:24 +0100 Subject: [PATCH 14/18] Final checklist: trim dimensions to 3-6 items, remove last reactor example - Test Coverage: 9 -> 5 items (merged test layer list, removed CI-like cross-TFM and Debug/Release rules) - FSharp.Core: 7 -> 6 items (merged XML doc + RFC into one bullet) - Extraction pipeline: remove 'reactor thread' from obsolete-term example - All dimensions now 3-6 CHECK items (sub-agent sized) Checklist verification: [x] reactor: zero mentions across all files [x] all technical terms verified against src/ (zero-match = removed) [x] no single-PR overfitting in CHECK items [x] parallel compilation generalized into Determinism principle [x] overfitting prevention inlined into extraction pipeline [x] all dimensions 3-6 items (sub-agent appropriate) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/expert-reviewer.md | 21 +++++---------------- .github/agents/extraction-pipeline.md | 2 +- 2 files changed, 6 insertions(+), 17 deletions(-) diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index 6cf3138fe1..aeff688d6a 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -49,20 +49,10 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b **CHECK:** - Verify that every code path added or modified has a test exercising it. -- Test the happy path, negative path (invalid input, error conditions), and feature interactions (how the change interacts with generics, constraints, computation expressions, etc.). +- Test happy path, negative path (invalid input, error conditions), and feature interactions (generics, constraints, computation expressions). - Tests must actually assert the claimed behavior — a test that calls a function without checking results is not a test. -- Confirm new tests cover behavior not already exercised by existing test suites. - Explain all new errors in test baselines and confirm they are expected. -- Ensure tests cover both Debug and Release codegen differences when relevant. -- Consider cross-TFM behavior (desktop .NET Framework vs CoreCLR) for platform-sensitive changes. -- Place tests in the appropriate layer based on what changed: - - Typecheck tests: type inference, constraint solving, overload resolution, expected warnings/errors - - SyntaxTreeTests: parser/syntax changes - - EmittedIL tests: codegen/IL shape changes - - compileAndRun tests: end-to-end behavioral correctness requiring .NET runtime execution - - Service.Tests: FCS API, editor features - - FSharp.Core.Tests: core library changes -- A PR can and often should have tests in multiple layers. +- Place tests in the appropriate layer: Typecheck (inference, overloads), SyntaxTreeTests (parser), EmittedIL (codegen), compileAndRun (runtime behavior), Service.Tests (FCS API), FSharp.Core.Tests (core library). A PR can span multiple layers. **Severity:** Missing tests for behavioral changes → **high**. Missing cross-TFM coverage → **medium**. @@ -78,10 +68,9 @@ FSharp.Core is the one assembly every F# program references. Changes here have o - Maintain strict backward binary compatibility. No public API removals or signature changes. - Verify compilation order constraints — FSharp.Core has strict file ordering requirements. - Add unit tests to `FSharp.Core.Tests` for every new or changed function. -- Minimize FCS's FSharp.Core dependency — the compiler should be hostable with different FSharp.Core versions; unnecessary coupling breaks this. -- XML doc comments are mandatory for all public APIs in FSharp.Core. -- New API additions require an RFC. -- Apply `InlineIfLambda` to inlined functions taking a lambda applied only once — this enables the lambda to be inlined at the call site, eliminating closure allocation. +- Minimize FCS's FSharp.Core dependency — the compiler should be hostable with different FSharp.Core versions. +- XML doc comments are mandatory for all public APIs. New API additions require an RFC. +- Apply `InlineIfLambda` to inlined functions taking a lambda applied only once — eliminates closure allocation at call sites. **Severity:** Binary compat break in FSharp.Core → **critical**. Missing tests → **high**. Missing XML docs → **medium**. diff --git a/.github/agents/extraction-pipeline.md b/.github/agents/extraction-pipeline.md index 71fb2b3d7e..5738695c9d 100644 --- a/.github/agents/extraction-pipeline.md +++ b/.github/agents/extraction-pipeline.md @@ -117,7 +117,7 @@ Collected data references files, folders, and terminology as they existed at the 3. If missing, search for the filename in its current location (files get moved between folders). Update the path if found. 4. If the file was deleted entirely, keep the comment's essence (the rule it teaches) but drop the file pointer. The rule may still apply to successor code. -**Technical terms:** `grep` every technical term used in comments (function names, type names, concepts like "reactor thread") against the current codebase. Terms with zero matches are obsolete — do not use them in generated artifacts. +**Technical terms:** `grep` every technical term used in comments (function names, type names, internal concepts) against the current codebase. Terms with zero matches are obsolete — do not use them in generated artifacts. ### 1.5 Backup From 082f73d3964a15c3978462965dccaa404f65fc76 Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 15:42:50 +0100 Subject: [PATCH 15/18] Add Phase 5.6: per-dimension codebase verification step New dedicated step where a fresh-context sub-agent reads each dimension's CHECK items against the actual codebase and CI config to: - grep every technical term (zero matches = obsolete, remove) - cross-check rules against CI (CI already does it = not a review rule) - verify generalizability (applies to future unknown PRs?) - verify rationale clarity (why does violating this rule break things?) Grades each item A/B/C/D with targets >=80% A, 0% C/D. Previous overfitting check moved to 5.7 as a final pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/extraction-pipeline.md | 32 ++++++++++++++++++--------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/.github/agents/extraction-pipeline.md b/.github/agents/extraction-pipeline.md index 5738695c9d..a204197429 100644 --- a/.github/agents/extraction-pipeline.md +++ b/.github/agents/extraction-pipeline.md @@ -384,21 +384,31 @@ Verify the three layers work together: - No body overlap between instructions and AGENTS.md/copilot-instructions.md - Agent doesn't repeat AGENTS.md content -### 5.6 Overfitting verification +### 5.6 Codebase verification -Run a confusion audit: give a fresh-context sub-agent (with NO knowledge of the source data) the generated agent file and ask it to grade every CHECK item on a confusion scale: +For each dimension in the generated agent, dispatch a fresh-context sub-agent that reads: +1. The dimension's CHECK items +2. The actual codebase (`src/`, `tests/`, `eng/`) +3. The repo's CI configuration (e.g., `azure-pipelines*.yml`, `.github/workflows/`) -- **A (Clear)**: Any competent developer could apply this rule to a random future PR. -- **B (Needs context)**: Rule is correct but needs a "why" or "when" — add one sentence of rationale. -- **C (Overfitted)**: Rule is too specific to one historical scenario — generalize or remove. -- **D (Confusing)**: Rule is ambiguous or contradictory — rewrite. +The sub-agent answers for every CHECK item: +- **Does this term exist in the codebase?** `grep` every function name, type name, and concept. Zero matches = obsolete, remove or replace. +- **Does CI already enforce this?** If the rule says "test on multiple platforms" and CI runs on 3 OSes, drop the rule — CI handles it. +- **Is this generalizable?** Could a reviewer apply this to a PR implementing a feature that doesn't exist yet? If it only makes sense for one specific code path, either generalize it or move it to a code comment. +- **Is the "why" clear?** Would a developer who has never seen this codebase understand what goes wrong if they violate this rule? If not, add a one-sentence rationale. -Target: ≥80% grade A, 0% grade C/D. Fix all C/D items. Add "why" to all B items. +Grade each item: **A** (clear, verified), **B** (needs rationale — add it), **C** (overfitted — generalize or remove), **D** (obsolete/contradictory — rewrite or remove). -Also check: -- No rules that reproduce what CI already enforces (e.g., "run tests on Linux" when CI covers all platforms) -- No rules referencing specific function names or line numbers unless those functions are long-lived stable APIs +**Targets:** ≥80% grade A, 0% grade C/D. Fix all B/C/D items before finalizing. + +**Commit** after verification fixes. + +### 5.7 Overfitting verification + +Final check on the complete artifact set: +- No rules that reproduce what CI already enforces +- No rules referencing specific function names or line numbers unless those functions are long-lived stable APIs (verified by grep in 5.6) - Every CHECK item is phrased as a generalizable principle, not a transcription of one PR's feedback -- Dimension frequency is counted by PRs, not by comments — a PR with 50 comments counts the same as one with 1 comment +- Dimension frequency was counted by PRs, not by comments — a PR with 50 comments counts the same as one with 1 comment From c946e20d5d6e017721fe55c9d99a748b86a33c7d Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 15:52:18 +0100 Subject: [PATCH 16/18] Fix 4 critical pipeline issues from fresh Opus audit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Critical fixes: 1. PR-normalization now mechanically enforced: new §2.2b creates pr_rule_votes table (DISTINCT activity_id per rule) before synthesis. Synthesis agent explicitly blocked from raw comment data. 2. Anti-overfitting rules now apply to artifact generation (§3.1), not just classification. Routing table placement clarified. 3. Sub-agent error handling: validate completeness (not just file size), retry up to 3 times, log and continue on persistent failure. 4. Feedback loop: §5.6 findings >20% C/D trigger re-run of Phase 2-3, not just artifact patching. Medium fixes: - CI analysis moved to §2.1 (input for all classifiers) - Bootstrap sample now stratified (by year, area, comment type) - Phase 4 renamed to clarify it's a spec, not a pipeline step - '~15 items' disambiguated to '~15 PRs' everywhere - Per-PR pagination explicitly required in §1.2 - Model specification made flexible (best available, not hardcoded) - Feature area table schema defined Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/extraction-pipeline.md | 53 ++++++++++++++++++++------- 1 file changed, 39 insertions(+), 14 deletions(-) diff --git a/.github/agents/extraction-pipeline.md b/.github/agents/extraction-pipeline.md index a204197429..c8b6530d20 100644 --- a/.github/agents/extraction-pipeline.md +++ b/.github/agents/extraction-pipeline.md @@ -21,9 +21,9 @@ This pipeline processes **thousands** of GitHub items (typically 3,000–10,000+ Store all intermediate results in **SQLite** (queryable) and **JSON backup files** (recoverable). Sub-agents write results to files; the orchestrator imports into SQLite and dispatches next phase. Never rely on passing large datasets through context — use the filesystem. -Use `model: "claude-opus-4.6"` for all sub-agents. Use background mode (`mode: "background"`) so agents run in parallel and the orchestrator is notified on completion. +Use `model: "claude-opus-4.6"` or the best available reasoning model for sub-agents. Do not use fast/cheap models for classification or synthesis — they produce shallow rules. Use background mode (`mode: "background"`) so agents run in parallel and the orchestrator is notified on completion. -**After each batch of sub-agents completes, validate output files.** Check that each file is >500 bytes and parseable JSON. Re-dispatch any that produced empty/placeholder files. Keep batch assignments to ≤5 batches per agent. +**After each batch of sub-agents completes, validate output files.** Check that each file is >500 bytes, parseable JSON, and contains entries for all assigned items (not partial). Re-dispatch any that produced empty, placeholder, or incomplete files — retry up to 3 times. On 3rd failure, log and continue. Keep batch assignments to ≤5 batches per agent. ## Inputs @@ -97,7 +97,7 @@ For EVERY indexed item (not a sample), fetch the user's actual words. **All comm Store in SQLite (`user_comments` table): comment_id, activity_id, repo, comment_type (pr_description, issue_description, issue_comment, review_comment, pr_comment, review, discussion_comment), body, created_at, file_path, diff_hunk, url. -This is the most API-intensive phase. Batch into sub-agents of ~15 items each — small batches ensure each agent completes reliably. Parallelize aggressively. Handle rate limits with retry. +This is the most API-intensive phase. Batch into sub-agents of ~15 PRs each (not 15 comments — each agent handles 15 PRs and fetches all comment types for each). When fetching comments for a single PR, paginate through all pages (`get_review_comments` returns max 100 per page). Parallelize aggressively. Handle rate limits with retry and exponential backoff. ### 1.3 Collect PR context @@ -134,14 +134,15 @@ Before analyzing comments, understand the codebase: - Existing documentation (specs, wiki, guides) - Existing `.github/` artifacts (instructions, skills, agents, copilot-instructions.md, AGENTS.md) - Technology stack, conventions, key files +- **CI configuration** — analyze CI files (`azure-pipelines*.yml`, `.github/workflows/`) and produce a CI coverage summary: what CI already enforces (platform coverage, test suites, formatting, etc.). Provide this summary to every classification sub-agent in §2.2. -Store as a feature area reference table in SQLite. +Store feature areas in SQLite: `CREATE TABLE feature_areas (area_name TEXT, folder_glob TEXT, description TEXT)`. Store CI summary as a text file. ### 2.2 Semantic analysis For each collected comment, classify using a sub-agent (Opus). **Do not use a hardcoded category list** — derive categories from the data: -1. **Bootstrap pass**: Take a random sample of ~300 comments from diverse PRs (not just the PRs with the most comments). Ask a sub-agent to read them and propose a category taxonomy. The agent should identify recurring themes, name them, and define each in one sentence. Expect 15–40 categories to emerge. +1. **Bootstrap pass**: Take a stratified sample of ~300 comments: proportional by year, at least 5 per major feature area from §2.1, and at least 20 each of review_comments, pr_descriptions, and issue_comments. Ask a sub-agent to read them and propose a category taxonomy. The agent should identify recurring themes, name them, and define each in one sentence. Expect 15–40 categories to emerge. After deriving the taxonomy, cross-check it against the feature area table — if any area representing >10% of the codebase has zero categories, re-sample with enforced coverage. 2. **Classification pass**: Using the derived taxonomy, classify all comments in batches (~15 PR packets per sub-agent, where each packet includes all comments on that PR). For each comment extract: - **Categories** (one or more, from the derived taxonomy) @@ -162,11 +163,23 @@ Store in SQLite (`comment_analysis` table). Process in batches. Use sub-agents — each handles ~15 PR packets with full context. Run in parallel. +### 2.2b Deduplication (enforce PR-normalization) + +Before synthesis, collapse per-comment rows into per-PR votes: + +```sql +CREATE TABLE pr_rule_votes AS +SELECT DISTINCT activity_id, derived_rule, category, feature_area +FROM comment_analysis; +``` + +This ensures a PR with 50 comments gets weight=1, same as a PR with 1 comment. The synthesis agent in §2.3 reads `pr_rule_votes`, never raw `comment_analysis`. + ### 2.3 Clustering -Aggregate analysis results to identify: +Aggregate the deduplicated `pr_rule_votes` to identify: -1. **Review dimensions**: Recurring themes across hundreds of comments. Each dimension should be specific enough to act on, broad enough to apply across many PRs. Target 8–24 dimensions. +1. **Review dimensions**: Recurring themes across many PRs. Each dimension should be specific enough to act on, broad enough to apply across many PRs. Target 8–24 dimensions. If any single dimension accounts for >40% of total PR-votes, flag it for splitting. 2. **Folder hotspots**: Which directories receive the most review feedback, and which dimensions apply there. @@ -174,11 +187,19 @@ Aggregate analysis results to identify: 4. **Repo-specific knowledge**: Rules that are unique to this codebase, not generic programming advice. -Use a synthesis sub-agent (Opus) that reads all analysis summaries and produces: -- Dimension list with rules, severity, evidence +The synthesis sub-agent receives: +- The taxonomy from §2.2 step 1 +- The `pr_rule_votes` table (deduplicated: one vote per rule per PR) +- The `feature_areas` table from §2.1 +- The CI coverage summary from §2.1 + +The synthesis agent MUST NOT access raw comment data (`user_comments` table or JSON backups). It works only with classified, deduplicated data. + +It produces: +- Dimension list with rules, severity, and PR-count evidence - Folder → dimension mapping - Principle list -- Knowledge area reference table +- A `dimension_evidence` table: `(dimension, pr_count, example_prs)` for verification in Phase 5 --- @@ -204,11 +225,11 @@ Generate three artifact types: - Reference docs, don't reproduce them #### Review Agent (`.github/agents/{agent_name}.md`) -- Single source of truth for the review methodology — all critical content must be inline in this file, not in separate reference files that might not be read +- Single source of truth for dimension definitions and review workflow — all CHECK rules must be inline - Contains: overarching principles, all dimensions inline (with rules + CHECK flags), review workflow -- The folder→dimension routing table belongs in the skill (not duplicated here) +- The folder→dimension routing table belongs in the skill (operational configuration, not methodology) — the agent references the skill during Wave 1 to select dimensions - The review workflow is 5 waves (see below) -- Every CHECK item must be a generalizable principle applicable to future PRs about features that don't exist yet — not a transcription of one historical PR's feedback +- The artifact-generation sub-agent must follow the same anti-overfitting rules from §2.2: every CHECK item must be a generalizable principle applicable to future PRs about features that don't exist yet **Commit** after raw creation. @@ -244,7 +265,9 @@ Compare new artifacts against existing `.github/` content: --- -## Phase 4: Review Workflow (embedded in the agent) +## Phase 4: Review Workflow Specification (embedded in the generated agent) + +This section defines the workflow that the generated review agent will follow at runtime. The pipeline does not execute this workflow — it embeds it as instructions in the agent artifact. The review agent runs a 5-wave process when invoked: @@ -401,6 +424,8 @@ Grade each item: **A** (clear, verified), **B** (needs rationale — add it), ** **Targets:** ≥80% grade A, 0% grade C/D. Fix all B/C/D items before finalizing. +**Feedback loop:** If >20% of items are grade C/D, the problem is in classification (Phase 2), not just in the artifact. Re-run §2.2 classification for the affected categories with strengthened anti-overfitting prompts, then re-run §2.3 synthesis and §3 artifact generation. Fixing artifacts alone treats symptoms. + **Commit** after verification fixes. ### 5.7 Overfitting verification From c939314304de21f0bdad3c5218b4f96ab81ef49f Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 16:04:00 +0100 Subject: [PATCH 17/18] Add pipeline overview diagram, I/O annotations, resumability MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Restructure pipeline for clarity without changing behavior: - Add ASCII flow diagram showing Phase 1→2→3→5 data flow with named intermediate artifacts at each stage - Add I/O block to every phase/step (45 annotations total): Sub-agents, Input, Output, Resume, Validation, Context constraints - Each step now explicitly states what tables/files it reads and writes - Resume guidance: skip phases whose output tables already exist - Context management: synthesis agent explicitly blocked from raw data - Phase 4 clearly marked as embedded spec, not pipeline step 439 -> 518 lines (80 lines of I/O metadata, 0 behavioral changes) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/extraction-pipeline.md | 81 ++++++++++++++++++++++++++- 1 file changed, 80 insertions(+), 1 deletion(-) diff --git a/.github/agents/extraction-pipeline.md b/.github/agents/extraction-pipeline.md index c8b6530d20..2a1576b038 100644 --- a/.github/agents/extraction-pipeline.md +++ b/.github/agents/extraction-pipeline.md @@ -7,7 +7,27 @@ description: "Extracts review expertise from a GitHub user's history and generat Generate folder-scoped instructions, topic-scoped skills, and a multi-dimensional review agent from a GitHub user's public review history. Produces anonymized, deduplicated, Copilot-compatible `.github/` artifacts. -## Scale Warning +## Pipeline Overview + +``` +Phase 1: Collect Phase 2: Enrich Phase 3: Generate Phase 5: Verify +───────────────── ─────────────── ──────────────── ─────────────── +1.1 Index activity 2.1 Study repo 3.1 Raw creation 5.1-5.5 Quality checks + → gh_activity → feature_areas → agent.md (raw) 5.6 Codebase verify +1.2 Collect comments → ci_summary.txt → SKILL.md (raw) → agent.md (verified) + → user_comments 2.2 Classify → instructions 5.7 Overfitting check +1.3 Collect PR context → comment_analysis (raw) → final artifacts + → pr_contexts 2.2b Deduplicate 3.2 Anonymize +1.4 Reconcile paths → pr_rule_votes → *.md (anon) Phase 4 is NOT a pipeline + → user_comments 2.3 Synthesize 3.3 Anthropic guide step — it defines the + (paths updated) → dimensions.json → *.md (polished) review workflow EMBEDDED +1.5 Backup → principles.json 3.4 Deduplicate in the generated agent. + → JSON files → dim_evidence → *.md (deduped) +``` + +Each phase checks its output tables exist before running — skip completed phases on resume. + +## Scale This pipeline processes **thousands** of GitHub items (typically 3,000–10,000+ issues, PRs, discussions, and review comments spanning a decade). It will not fit in a single context window. @@ -56,6 +76,11 @@ If `reference_repos` are specified and the pipeline needs to search their code ( ### 1.1 Index all activity +> **Sub-agents:** 1 per repo × date-range chunk (parallelize 6+) +> **Input:** GitHub API search results +> **Output:** SQLite `gh_activity` table, JSON backup per chunk +> **Resume:** Skip if `gh_activity` has rows for this repo+date range + Search each repo for issues, PRs, and discussions where `username` participated. **Include ALL states** — open, closed, merged, and rejected PRs all carry learning potential. Rejected PRs often contain the strongest review opinions. GitHub search returns max 1000 results per query — split by 1-year date ranges to capture everything. For high-volume users, split by 6 months. @@ -85,6 +110,12 @@ Parallelize across repos and date ranges. Use sub-agents for large volumes. Pagi ### 1.2 Collect actual comments +> **Sub-agents:** 1 per ~15 PRs (parallelize aggressively) +> **Input:** `gh_activity` table (PR/issue numbers to fetch) +> **Output:** SQLite `user_comments` table, JSON backup per batch +> **Resume:** Skip PRs already in `user_comments` +> **Validation:** Each output file >500 bytes, contains entries for all 15 assigned PRs + For EVERY indexed item (not a sample), fetch the user's actual words. **All comment types matter:** - **PR descriptions** (when user is author): `pull_request_read` → `get` → save body. These reveal design intent and priorities — often the most valuable content. @@ -101,6 +132,10 @@ This is the most API-intensive phase. Batch into sub-agents of ~15 PRs each (not ### 1.3 Collect PR context +> **Sub-agents:** 1 per ~15 PRs (can share with 1.2 agents) +> **Input:** `gh_activity` table (PRs with review comments) +> **Output:** SQLite `pr_contexts` table (files_changed, labels, description per PR) + For PRs with review comments, also collect: - Files changed (`get_files`): path, additions, deletions, status - PR labels and description @@ -109,6 +144,10 @@ This maps comments to code areas. ### 1.4 Cross-validate against current codebase +> **Sub-agents:** 1 (or orchestrator directly) +> **Input:** `user_comments` table, local repo checkout +> **Output:** `user_comments` table (paths updated in-place), `path_mapping` table, `obsolete_terms` list + Collected data references files, folders, and terminology as they existed at the time of the comment — migrations and refactorings happen. Reconcile before enrichment: **File paths:** @@ -129,6 +168,10 @@ Write all collected data as JSON to a backup directory (e.g., `{landing_repo}-an ### 2.1 Study the landing repo +> **Sub-agents:** 1 (explore agent) +> **Input:** Local repo checkout (`src/`, `tests/`, `eng/`, `.github/`, CI configs) +> **Output:** SQLite `feature_areas` table, `ci_summary.txt`, `existing_artifacts.txt` + Before analyzing comments, understand the codebase: - Directory structure → feature area mapping - Existing documentation (specs, wiki, guides) @@ -140,6 +183,11 @@ Store feature areas in SQLite: `CREATE TABLE feature_areas (area_name TEXT, fold ### 2.2 Semantic analysis +> **Sub-agents:** 1 for bootstrap, then ~N/15 for classification (where N = number of PRs with comments) +> **Input:** `user_comments` table, `feature_areas` table, `ci_summary.txt` +> **Output:** SQLite `comment_analysis` table, `taxonomy.json` +> **Context per sub-agent:** taxonomy + CI summary + 15 PR packets (all comments on each PR) + For each collected comment, classify using a sub-agent (Opus). **Do not use a hardcoded category list** — derive categories from the data: 1. **Bootstrap pass**: Take a stratified sample of ~300 comments: proportional by year, at least 5 per major feature area from §2.1, and at least 20 each of review_comments, pr_descriptions, and issue_comments. Ask a sub-agent to read them and propose a category taxonomy. The agent should identify recurring themes, name them, and define each in one sentence. Expect 15–40 categories to emerge. After deriving the taxonomy, cross-check it against the feature area table — if any area representing >10% of the codebase has zero categories, re-sample with enforced coverage. @@ -165,6 +213,10 @@ Process in batches. Use sub-agents — each handles ~15 PR packets with full con ### 2.2b Deduplication (enforce PR-normalization) +> **Sub-agents:** None (orchestrator runs SQL directly) +> **Input:** `comment_analysis` table +> **Output:** `pr_rule_votes` table (1 vote per rule per PR) + Before synthesis, collapse per-comment rows into per-PR votes: ```sql @@ -177,6 +229,11 @@ This ensures a PR with 50 comments gets weight=1, same as a PR with 1 comment. T ### 2.3 Clustering +> **Sub-agents:** 1 (Opus, synthesis) +> **Input:** `pr_rule_votes` table, `taxonomy.json`, `feature_areas` table, `ci_summary.txt` +> **NOT available:** raw `user_comments`, JSON backups — synthesis works only with classified, deduplicated data +> **Output:** `dimensions.json`, `principles.json`, `folder_hotspots.json`, SQLite `dimension_evidence` table + Aggregate the deduplicated `pr_rule_votes` to identify: 1. **Review dimensions**: Recurring themes across many PRs. Each dimension should be specific enough to act on, broad enough to apply across many PRs. Target 8–24 dimensions. If any single dimension accounts for >40% of total PR-votes, flag it for splitting. @@ -207,6 +264,11 @@ It produces: ### 3.1 Raw creation +> **Sub-agents:** 1 per artifact type (3 total) +> **Input:** `dimensions.json`, `principles.json`, `folder_hotspots.json`, existing `.github/` artifacts +> **Output:** `agent.md` (raw), `SKILL.md` (raw), `*.instructions.md` (raw) +> **Anti-overfitting:** generation sub-agents must follow the same rules from §2.2 + Generate three artifact types: #### Instructions (`.github/instructions/*.instructions.md`) @@ -235,12 +297,18 @@ Generate three artifact types: ### 3.2 Anonymize +> **Input:** `*.md` (raw) +> **Output:** `*.md` (anonymized) + Remove all personal names, comment counts, PR number references, evidence statistics, "distilled from" language. The artifacts should read as authoritative engineering guidance, not data analysis output. **Commit** after anonymization. ### 3.3 Improve per Anthropic guide +> **Input:** `*.md` (anonymized) +> **Output:** `*.md` (polished) + Apply https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices: - `name`: gerund form, lowercase+hyphens - `description`: third person, specific triggers, ≤1024 chars @@ -254,6 +322,9 @@ Apply https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-pra ### 3.4 Deduplicate and cross-reference +> **Input:** `*.md` (polished), existing `.github/` content +> **Output:** `*.md` (deduplicated) — committed to repo + Compare new artifacts against existing `.github/` content: - Check trigger overlap between new and existing skills - Check body overlap (same content in two places) — if the same concept appears in both agent and skill, keep it in the agent (source of truth) and have the skill point to it @@ -409,6 +480,11 @@ Verify the three layers work together: ### 5.6 Codebase verification +> **Sub-agents:** 1 per dimension (parallelize) +> **Input:** `agent.md` (deduplicated) CHECK items, local repo (`src/`, `tests/`, `eng/`), CI config +> **Output:** `confusion_audit.json` (grade per item), `agent.md` (verified) +> **Feedback loop:** If >20% grade C/D → re-run Phase 2.2 + 2.3 + 3 with stronger anti-overfitting + For each dimension in the generated agent, dispatch a fresh-context sub-agent that reads: 1. The dimension's CHECK items 2. The actual codebase (`src/`, `tests/`, `eng/`) @@ -430,6 +506,9 @@ Grade each item: **A** (clear, verified), **B** (needs rationale — add it), ** ### 5.7 Overfitting verification +> **Input:** All `*.md` (verified), `dimension_evidence` table from §2.3 +> **Output:** Final artifacts — committed to repo + Final check on the complete artifact set: - No rules that reproduce what CI already enforces - No rules referencing specific function names or line numbers unless those functions are long-lived stable APIs (verified by grep in 5.6) From c25cdd30e6e8cdd5526ad95754de59ed28ba3e9b Mon Sep 17 00:00:00 2001 From: Tomas Grosup Date: Wed, 18 Mar 2026 16:06:10 +0100 Subject: [PATCH 18/18] Make pipeline fully tech-agnostic, tighten Scale section - Remove all F#/compiler-specific examples (stripTyEqns, LanguageFeature, BindUnitVars, dotnet/fsharp, azure-pipelines) - Replace with generic equivalents that work for any repo/language - Consolidate Scale section: remove sub-agent list that duplicated per-phase batch sizes, organize into Context/Model/Reliability - CI config references now generic (GitHub Actions, Azure Pipelines, Jenkins, etc.) 518 -> 512 lines. Zero tech-specific references remaining. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/agents/extraction-pipeline.md | 26 ++++++++++---------------- 1 file changed, 10 insertions(+), 16 deletions(-) diff --git a/.github/agents/extraction-pipeline.md b/.github/agents/extraction-pipeline.md index 2a1576b038..0daf23aad7 100644 --- a/.github/agents/extraction-pipeline.md +++ b/.github/agents/extraction-pipeline.md @@ -31,19 +31,13 @@ Each phase checks its output tables exist before running — skip completed phas This pipeline processes **thousands** of GitHub items (typically 3,000–10,000+ issues, PRs, discussions, and review comments spanning a decade). It will not fit in a single context window. -**Use sub-agents for everything.** The orchestrator manages SQLite state and dispatches work. Sub-agents do the heavy lifting: -- **Collection**: one sub-agent per repo × date range chunk. Parallelize aggressively (6+ concurrent agents). -- **Comment fetching**: one sub-agent per ~15 items. Small batches ensure reliable completion — agents given 200+ items often give up partway or write placeholder files. -- **Semantic analysis**: one sub-agent per ~15 PR packets. Each classifies and extracts rules. -- **Synthesis**: one sub-agent to read all analysis summaries (not raw data) and produce dimensions/principles. -- **Artifact generation**: one sub-agent per artifact type (instructions, skills, agent). -- **Validation/improvement**: one sub-agent per file or per concern. +**Use sub-agents for everything.** The orchestrator manages SQLite state and dispatches work. Sub-agents do the heavy lifting — see each phase for batch sizes and parallelism guidance. -Store all intermediate results in **SQLite** (queryable) and **JSON backup files** (recoverable). Sub-agents write results to files; the orchestrator imports into SQLite and dispatches next phase. Never rely on passing large datasets through context — use the filesystem. +**Context management:** Store all intermediate results in **SQLite** (queryable) and **JSON backup files** (recoverable). Sub-agents write results to files; the orchestrator imports into SQLite and dispatches the next phase. Never pass large datasets through agent context — use the filesystem. -Use `model: "claude-opus-4.6"` or the best available reasoning model for sub-agents. Do not use fast/cheap models for classification or synthesis — they produce shallow rules. Use background mode (`mode: "background"`) so agents run in parallel and the orchestrator is notified on completion. +**Model selection:** Use the best available reasoning model (e.g., `claude-opus-4.6`) for classification and synthesis sub-agents. Fast/cheap models produce shallow rules. Collection sub-agents can use standard models. Use background mode so agents run in parallel. -**After each batch of sub-agents completes, validate output files.** Check that each file is >500 bytes, parseable JSON, and contains entries for all assigned items (not partial). Re-dispatch any that produced empty, placeholder, or incomplete files — retry up to 3 times. On 3rd failure, log and continue. Keep batch assignments to ≤5 batches per agent. +**Reliability:** After each batch of sub-agents completes, validate output files: >500 bytes, parseable JSON, contains entries for all assigned items. Re-dispatch incomplete outputs up to 3 times. Keep batch assignments to ≤5 batches per agent — agents given too much work produce placeholders or give up. ## Inputs @@ -51,7 +45,7 @@ Collect these before starting. If any are missing, ask the user via the `ask_use | Parameter | Required | Example | |-----------|----------|---------| -| `landing_repo` | yes | `dotnet/fsharp` — the repo receiving the artifacts | +| `landing_repo` | yes | `owner/repo` — the repo receiving the artifacts | | `username` | yes | GitHub username whose review history to extract | | `earliest_year` | no (default: 10 years back) | `2015` | | `reference_repos` | no | Additional repos to search (e.g., `dotnet/sdk`, `dotnet/runtime`) | @@ -177,7 +171,7 @@ Before analyzing comments, understand the codebase: - Existing documentation (specs, wiki, guides) - Existing `.github/` artifacts (instructions, skills, agents, copilot-instructions.md, AGENTS.md) - Technology stack, conventions, key files -- **CI configuration** — analyze CI files (`azure-pipelines*.yml`, `.github/workflows/`) and produce a CI coverage summary: what CI already enforces (platform coverage, test suites, formatting, etc.). Provide this summary to every classification sub-agent in §2.2. +- **CI configuration** — analyze CI config files (GitHub Actions, Azure Pipelines, Jenkins, etc.) and produce a CI coverage summary: what CI already enforces (platform coverage, test suites, formatting, linting, etc.). Provide this summary to every classification sub-agent in §2.2. Store feature areas in SQLite: `CREATE TABLE feature_areas (area_name TEXT, folder_glob TEXT, description TEXT)`. Store CI summary as a text file. @@ -203,9 +197,9 @@ For each collected comment, classify using a sub-agent (Opus). **Do not use a ha **Anti-overfitting rules for classification:** - **Normalize by PR, not by comment.** A PR with 50 comments gets weight=1, same as a PR with 1 comment. Count how many PRs a rule appears in, not how many comments mention it. The reviewer saying something once on a clean PR means the same as repeating it 10 times on a messy one. -- **Generalize, don't transcribe.** The derived rule must be applicable to a future PR the classifier has never seen. "Always call stripTyEqns before matching types" is good. "Call stripTyEqns on line 47 of CheckPatterns.fs" is overfitted. -- **Distinguish reviewer opinion from CI enforcement.** If the reviewer says "please add tests", that's a review rule. If the reviewer says "run tests on Linux and Windows", that might just mean "CI should cover this" — not a rule for human reviewers. Check: does the repo's CI already do this? If yes, don't encode it as a review rule. -- **Distinguish design guidance from implementation instruction.** "Gate features behind LanguageFeature flags" is design guidance (always applicable). "Use BindUnitVars after stripping the lambda" is an implementation instruction for a specific code path (only applicable when touching that code). +- **Generalize, don't transcribe.** The derived rule must be applicable to a future PR the classifier has never seen. "Always normalize type representations before pattern matching" is good. "Call helper X on line 47 of file Y" is overfitted. +- **Distinguish reviewer opinion from CI enforcement.** If the reviewer says "please add tests", that's a review rule. If the reviewer says "run tests on Linux and Windows", that might just mean "CI should cover this" — not a rule for human reviewers. Check the CI summary from §2.1: if CI already enforces it, don't encode it as a review rule. +- **Distinguish design guidance from implementation instruction.** "Gate features behind feature flags" is design guidance (always applicable). "Use helper X after calling Y" is an implementation instruction for a specific code path (only applicable when touching that code). Store in SQLite (`comment_analysis` table). @@ -488,7 +482,7 @@ Verify the three layers work together: For each dimension in the generated agent, dispatch a fresh-context sub-agent that reads: 1. The dimension's CHECK items 2. The actual codebase (`src/`, `tests/`, `eng/`) -3. The repo's CI configuration (e.g., `azure-pipelines*.yml`, `.github/workflows/`) +3. The repo's CI configuration The sub-agent answers for every CHECK item: - **Does this term exist in the codebase?** `grep` every function name, type name, and concept. Zero matches = obsolete, remove or replace.