diff --git a/.github/agents/expert-reviewer.md b/.github/agents/expert-reviewer.md index ef0e38688c..aeff688d6a 100644 --- a/.github/agents/expert-reviewer.md +++ b/.github/agents/expert-reviewer.md @@ -1,23 +1,45 @@ --- name: expert-reviewer -description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates type checking, IL emission, AST correctness, binary compatibility, parallel determinism, concurrency, IDE performance, diagnostics, and code quality across 15 dimensions. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." +description: "Multi-dimensional code review agent for F# compiler PRs. Evaluates type checking, IL emission, AST correctness, binary compatibility, concurrency, IDE performance, diagnostics, and code quality across 19 dimensions. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." --- # Expert Reviewer -Evaluates F# compiler changes across 15 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. +Evaluates F# compiler changes across 19 dimensions. Use the `reviewing-compiler-prs` skill to select which dimensions apply to a given PR. **Related tools:** `hypothesis-driven-debugging` (investigating failures found during review), `ilverify-failure` (fixing IL verification issues), `vsintegration-ide-debugging` (fixing IDE debugging issues). ## Overarching Principles -- **Testing is the gating criterion.** No behavioral change merges without a test that exercises it. Missing tests are the single most common review blocker. -- **Binary compatibility is non-negotiable.** Any change to serialized metadata must preserve forward and backward compatibility across compiler versions. Treat pickled data like a wire protocol. +- **Testing is the gating criterion.** No behavioral change merges without a test that exercises it. Missing tests are the single most common review blocker. Do not submit features without updated tests; close and resubmit if tests are missing. +- **Binary compatibility is non-negotiable.** Any change to serialized metadata must preserve forward and backward compatibility across compiler versions. Treat pickled data like a wire protocol. Codegen changes that depend on new FSharp.Core functions must guard against older FSharp.Core versions being referenced. +- **FSharp.Core stability is sacrosanct.** Changes to FSharp.Core carry more risk than compiler changes because every F# program depends on it. Binary compatibility, compilation order, and API surface are all critical. Prefer consolidated changes through a single well-reviewed PR. - **Determinism is a correctness property.** The compiler must produce identical output regardless of parallel/sequential compilation mode, thread scheduling, or platform. -- **Feature gating protects users.** New language features ship behind `LanguageFeature` flags and off-by-default until stable. Breaking changes require an RFC. +- **Feature gating protects users.** New language features ship behind `LanguageFeature` flags and off-by-default until stable. Breaking changes require an RFC. Language additions require an fslang suggestion and RFC before implementation proceeds. - **Diagnostics are user-facing.** Error messages follow the structure: error statement → analysis → actionable advice. Wording changes need the same care as API changes. -- **IDE responsiveness is a feature.** Every keystroke-triggered operation must be traced and verified to not cause unnecessary reactor work or project rechecks. -- **Prefer general solutions over special cases.** Do not hardwire specific library optimizations into the compiler. Prefer inlining-based optimizations that apply broadly. +- **IDE responsiveness is a feature.** Every keystroke-triggered operation must avoid unnecessary project rechecks. Evaluate the queue stress impact of every new FCS request type. +- **Prefer general solutions over special cases.** Do not hardwire specific library optimizations into the compiler. Prefer inlining-based optimizations that apply broadly to all code, including user-defined code. +- **Evidence-based performance.** Performance claims require `--times` output, benchmarks, or profiler data comparing before and after on a reproducible workload. Do not inline large functions without performance evidence. +- **Guard the API surface.** Public API additions to FCS and FSharp.Core must be carefully controlled. Internal data structures must never leak. Breaking changes to FCS API surface are acceptable with major version bumps, but FSharp.Core must remain stable. + +## Anti-Patterns + +Push back against these recurring patterns: + +1. **Reformatting + logic in one PR** — Separate formatting into its own PR. Mixed diffs obscure logic changes and block review. +2. **Catch-all exception handlers** — Do not add catch-all exception handlers. Handle `OperationCanceledException` specially; never swallow it. (Exception: top-level language service entry points should catch all to prevent IDE crashes — but log, don't silence.) +3. **Internal type leakage** — Internal compiler data structures must not leak through the FCS (F# Compiler Service) public API. Leakage creates permanent API commitments from implementation details. +4. **Performance claims without data** — Require benchmarks, `--times` output, or profiler evidence for any performance claim. +5. **Raw TType_* pattern matching** — Never match on `TType_*` without first calling `stripTyEqns` (which resolves type abbreviations and equations). Skipping it causes missed matches on aliased/abbreviated types. Use `AppTy` active pattern instead of `TType_app`. +6. **Verbose inline logging** — Prefer structured/declarative tracing over inline logging calls that clutter the code. +7. **Conditional serialization writes** — Writes gated on compiler flags (LangVersion, feature toggles, TFM) produce misaligned byte streams for cross-version metadata. The byte count must depend only on stream-encoded data. +8. **Stale type-checking results** — Avoid returning stale results; they cause timing-dependent IntelliSense glitches. Prefer fresh results with cancellation support. +9. **Global mutable state** — Pass dependencies explicitly as parameters rather than using module-level mutable globals, to enable concurrent and snapshot-based processing. +10. **Missing XML doc comments** — Every new top-level function, module, and type definition must have a `///` comment. +11. **Shell script wrappers** — Prefer MSBuild targets over batch/shell scripts — scripts obscure build logic and break cross-platform. +12. **Large closures capturing unnecessary data** — Verify that long-lived closures don't capture more data than needed, causing memory leaks. +13. **Returning `Unchecked.defaultof<_>` to swallow exceptions** — This hides the root cause. Investigate and fix exception propagation failures. +14. **Band-aid ad-hoc patches** — Flag `if condition then specialCase else normalPath` that patches a consumer rather than fixing the source. ## Review Dimensions @@ -27,22 +49,10 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b **CHECK:** - Verify that every code path added or modified has a test exercising it. -- Test the happy path, negative path (invalid input, error conditions), and feature interactions (how the change interacts with generics, constraints, computation expressions, etc.). +- Test happy path, negative path (invalid input, error conditions), and feature interactions (generics, constraints, computation expressions). - Tests must actually assert the claimed behavior — a test that calls a function without checking results is not a test. -- Confirm new tests cover behavior not already exercised by existing test suites. - Explain all new errors in test baselines and confirm they are expected. -- Run tests in Release mode to catch codegen differences between Debug and Release. -- Run tests on both desktop (.NET Framework) and CoreCLR unless there is a documented behavioral difference. -- Place tests in the appropriate layer based on what changed: - - Typecheck tests: type inference, constraint solving, overload resolution, expected warnings/errors - - SyntaxTreeTests: parser/syntax changes - - EmittedIL tests: codegen/IL shape changes - - compileAndRun tests: end-to-end behavioral correctness requiring .NET runtime execution - - Service.Tests: FCS API, editor features - - FSharp.Core.Tests: core library changes -- A PR can and often should have tests in multiple layers. -- Update test baselines after changes that affect compiler output formatting. -- Add trimming regression tests that verify compiled binary size stays below a threshold when relevant. +- Place tests in the appropriate layer: Typecheck (inference, overloads), SyntaxTreeTests (parser), EmittedIL (codegen), compileAndRun (runtime behavior), Service.Tests (FCS API), FSharp.Core.Tests (core library). A PR can span multiple layers. **Severity:** Missing tests for behavioral changes → **high**. Missing cross-TFM coverage → **medium**. @@ -50,59 +60,68 @@ Every behavioral change, bug fix, and new feature requires corresponding tests b --- -### 2. Type System Correctness +### 2. FSharp.Core Stability -Type checking and inference must be sound. Subtle bugs in constraint solving, overload resolution, or scope handling can produce incorrect programs. +FSharp.Core is the one assembly every F# program references. Changes here have outsized blast radius. **CHECK:** -- Use `tryTcrefOfAppTy` or the `AppTy` active pattern instead of direct `TType` pattern matches. -- Avoid matching on empty contents as a proxy for signature file presence — use explicit markers. -- Verify nullness errors on return types are intentional and documented. -- Exclude current file signatures from `tcState` when processing implementation files. -- Distinguish initialization-related bugs from recursive type checking bugs when triaging. -- Document that type inference requires analyzing implementation before resolving member types. -- Add error recovery for namespace fragment combination differences between parallel and sequential paths. +- Maintain strict backward binary compatibility. No public API removals or signature changes. +- Verify compilation order constraints — FSharp.Core has strict file ordering requirements. +- Add unit tests to `FSharp.Core.Tests` for every new or changed function. +- Minimize FCS's FSharp.Core dependency — the compiler should be hostable with different FSharp.Core versions. +- XML doc comments are mandatory for all public APIs. New API additions require an RFC. +- Apply `InlineIfLambda` to inlined functions taking a lambda applied only once — eliminates closure allocation at call sites. -**Severity:** Type system unsoundness → **critical**. Incorrect inference in edge cases → **high**. +**Severity:** Binary compat break in FSharp.Core → **critical**. Missing tests → **high**. Missing XML docs → **medium**. -**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/TypedTree/` +**Hotspots:** `src/FSharp.Core/` --- -### 3. Binary Compatibility & Metadata Safety +### 3. Backward Compatibility Vigilance -The pickled metadata format is a cross-version contract. DLLs compiled with any F# version must be consumable by any other version. +Changes must not break existing compiled code or binary compatibility. **CHECK:** -- Never remove, reorder, or reinterpret existing serialized data fields. -- Ensure new data is invisible to old readers (stream B with tag-byte detection). -- Exercise old-compiler-reads-new-output and new-compiler-reads-old-output for any metadata change. -- Verify the byte count does not depend on compiler configuration (feature flags, LangVersion, TFM) — only on stream-encoded data. -- Add cross-version compatibility tests when changing anonymous record or constraint emission. -- Ensure new parameters are conditionally passed to preserve existing behavior for unchanged callers. +- Verify changes do not break existing compiled code or binary compatibility. +- Breaking changes should be gated as a warning first, not a hard error. +- Add new APIs alongside existing ones rather than replacing signatures. +- Codegen changes that depend on new FSharp.Core functions must guard against older FSharp.Core versions. +- Consider forward compatibility — avoid locking in behavior that blocks future language evolution. -**Severity:** Any metadata format breakage → **critical**. Missing compat test → **high**. +**Severity:** Binary compat break → **critical**. Behavioral change without flag → **high**. Missing compat test → **high**. -**Hotspots:** `src/Compiler/TypedTree/TypedTreePickle.fs`, `src/Compiler/Driver/CompilerImports.fs` +**Hotspots:** `src/Compiler/TypedTree/`, `src/Compiler/Driver/`, `src/FSharp.Core/` -*See also: `.github/instructions/TypedTreePickle.instructions.md` for detailed stream alignment rules.* +--- + +### 4. RFC Process & Language Design + +Major language changes require an RFC and design discussion before implementation. + +**CHECK:** +- Require an fslang suggestion and RFC for language and API additions. +- Submit one consolidated PR per RFC rather than multiple partial PRs. +- Update or create the RFC document when implementing a language or interop feature change. +- Keep design discussion in the RFC, not in PR comments. +- Do not rush language changes into a release without proper design review. + +**Severity:** Language change without RFC → **critical**. Missing RFC update → **high**. Design discussion in PR → **medium**. + +**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/SyntaxTree/`, `src/FSharp.Core/` --- -### 4. IL Emission & Debug Correctness +### 5. IL Codegen Correctness -Code generation must produce correct, verifiable IL. Debug sequence points must enable accurate stepping. +Code generation must produce correct, verifiable IL. Wrong IL produces silent runtime failures. **CHECK:** +- Ensure emitted IL is verifiable and matches expected instruction patterns. +- Verify no changes in tail-calling behavior — check IL diffs before and after. - Test code changes with optimizations both enabled and disabled. -- Strip debug points when matching on `Expr.Lambda` during code generation. -- Check for existing values on the IL stack when debugging codegen issues with debug points. -- Investigate IL diffs between parallel and sequential type checking to identify determinism root causes. -- Verify that newly added code paths are actually reachable before merging. -- Confirm that `Decimal` cases are not already eliminated by `TryEliminateDesugaredConstants` before adding them. -- Cross-reference `GenFieldInit` in `IlxGen.fs` when modifying field initialization in `infos.fs`. -- Manually verify debug stepping for loops, while loops, task code, list/array expressions, and sequence expressions. -- Perform end-to-end debug stepping verification for all affected control flow cases before merge. +- Solve debuggability or performance problems generally through techniques that also apply to user-written code, not special-cased optimizations. +- When matching on expression nodes during codegen, handle debug-point wrapper nodes to prevent IL stack corruption. **Severity:** Incorrect IL → **critical**. Debug stepping regression → **high**. Missing IL test → **medium**. @@ -110,116 +129,202 @@ Code generation must produce correct, verifiable IL. Debug sequence points must --- -### 5. Syntax Tree & Parser Integrity +### 6. Optimization Correctness -AST nodes must accurately represent source code. Parser changes are high-risk because they affect every downstream phase. +Optimizer changes must preserve program semantics. Inlining and tail-call changes are high-risk. **CHECK:** -- Update all pattern matches on `SynBinding` in tree-walking code when modifying its shape. -- Remove default wildcard patterns in type walkers to catch missing cases at compile time. -- Handle all cases of discriminated unions including spreads in tree walkers. -- Gate parser changes behind the appropriate language version. -- Add new parser cases to existing rules that share the same prefix. -- Visit spread types in `FileContentMapping` to build correct graph edges. -- Assess compound expression compatibility when introducing new syntax to avoid breaking existing code. -- Remove unused AST nodes created for experimental features that were not pursued. +- Verify optimizations preserve program semantics in all cases. +- Tail call analysis must correctly handle all cases including mutual recursion, not just simple self-recursion. +- Prefer general approaches (e.g., improved inlining) that cover many cases at once over hand-implementing function-by-function optimizations. +- Verify that expression restructuring optimizations don't regress code quality — compare IL before and after. +- Require performance evidence for optimization changes. -**Severity:** Incorrect AST node → **critical**. Missing walker case → **high**. Ungated parser change → **high**. +**Severity:** Semantic-altering optimization → **critical**. Tail-call regression → **high**. Missing evidence → **medium**. -**Hotspots:** `src/Compiler/SyntaxTree/`, `src/Compiler/pars.fsy` +**Hotspots:** `src/Compiler/Optimize/`, `src/Compiler/CodeGen/` + +--- + +### 7. FCS API Surface Control + +The FCS public API is a permanent commitment. Internal types must never leak. + +**CHECK:** +- Keep internal implementation details out of the public FCS API. +- The FCS Symbol API must be thread-safe for concurrent access. +- When changing internal implementation to async, keep FCS API signatures unchanged. +- Apply type safety with distinct types (not aliases) across the FCS API. +- Document the purpose of new public API arguments in XML docs. +- Update exception XML docs in `.fsi` files when behavior changes. + +**Severity:** Unintended public API break → **critical**. Internal type leakage → **high**. Missing XML docs → **medium**. + +**Hotspots:** `src/Compiler/Service/`, `src/FSharp.Core/` + +--- + +### 8. Type System Correctness + +Type checking and inference must be sound. Subtle bugs in constraint solving, overload resolution, or scope handling produce incorrect programs. + +**CHECK:** +- Always call `stripTyEqns`/`stripTyEqnsA` before pattern matching on types — this resolves type abbreviations and inference equations. Without it, aliased types won't match and code silently takes the wrong branch. Use `AppTy` active pattern instead of matching `TType_app` directly. +- Raise internal compiler errors for unexpected type forms rather than returning defaults — silent defaults hide bugs. +- Use property accessors on IL metadata types (e.g., `ILMethodDef` properties) rather than deconstructing them directly — the internal representation may change. + +**Severity:** Type system unsoundness → **critical**. Incorrect inference in edge cases → **high**. + +**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/TypedTree/` + +--- + +### 9. Struct Type Awareness + +Structs have value semantics that differ fundamentally from reference types. Incorrect handling causes subtle bugs. + +**CHECK:** +- Respect struct semantics: no unnecessary copies, proper byref handling. +- Before converting a type to struct, measure the impact — large structs lose sharing and can reduce throughput. +- Always add tests for struct variants when changing union or record type behavior. +- Investigate and fix incorrect behavior for struct types rather than working around it. + +**Severity:** Incorrect struct copy semantics → **critical**. Missing struct tests → **high**. Style → **low**. + +**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/CodeGen/` + +--- + +### 10. IDE Responsiveness + +The compiler service must respond to editor keystrokes without unnecessary recomputation. + +**CHECK:** +- Use fully async code in the language service; avoid unnecessary `Async.RunSynchronously`. +- Verify changes do not trigger endless project rechecks. +- Evaluate the queue stress impact of every new FCS request type — each request blocks the service queue while running, so expensive requests delay all other IDE features. +- Caching must prevent duplicate work per project. +- Test IDE changes on large solutions before merging. + +**Severity:** Endless recheck loop → **critical**. UI thread block → **high**. Missing trace verification → **medium**. + +**Hotspots:** `src/Compiler/Service/`, `src/FSharp.Compiler.LanguageServer/`, `vsintegration/` --- -### 6. Parallel Compilation & Determinism +### 11. Overload Resolution Correctness -Parallel type checking must produce bit-identical output to sequential compilation. +Overload resolution is one of the most complex and specification-sensitive areas of the compiler. **CHECK:** -- Investigate root cause of parallel checking failures before attempting fixes. -- Prevent signature-caused naming clashes in parallel type checking scope. -- Ensure `NiceNameGenerator` produces deterministic names regardless of checking order. -- Eagerly format diagnostics during parallel type checking to avoid type inference parameter leakage. -- Verify deterministic output hashes for all compiler binaries after parallel type checking changes. -- Use ILDASM to diff binaries when investigating determinism bugs. -- Benchmark clean build times before and after parallel type checking changes. -- Resolve naming clashes between static members and top-level bindings in parallel type checking. +- Ensure overload resolution follows the language specification precisely. +- Verify that language features work correctly with truly overloaded method sets, not just single-overload defaults. +- Changes that loosen overload resolution rules constitute language changes and need careful analysis. +- Apply method hiding filters (removing base-class methods overridden by derived-class methods) consistently in both normal resolution and SRTP constraint solving paths. +- For complex SRTP corner cases, changes must pin existing behavior with tests. -**Severity:** Non-deterministic output → **critical**. Scoping error → **high**. Missing benchmark → **medium**. +**Severity:** Overload resolution regression → **critical**. SRTP behavior change → **high**. Missing test → **medium**. -**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/Driver/` +**Hotspots:** `src/Compiler/Checking/ConstraintSolver.fs`, `src/Compiler/Checking/` --- -### 7. Concurrency & Cancellation Safety +### 12. Binary Compatibility & Metadata Safety + +The pickled metadata format is a cross-version contract. DLLs compiled with any F# version must be consumable by any other version. + +**CHECK:** +- Never remove, reorder, or reinterpret existing serialized data fields. +- Ensure new data is invisible to old readers (added to stream B with tag-byte detection — old readers get default `0` past end-of-stream). +- Exercise old-compiler-reads-new-output and new-compiler-reads-old-output for any metadata change. +- Verify the byte count does not depend on compiler configuration (feature flags, LangVersion, TFM) — only on stream-encoded data. +- Add cross-version compatibility tests for any change to metadata emission. +- Before modifying the typed tree or pickle format, check whether the feature can be expressed through existing IL metadata without changing internal representations. + +**Severity:** Any metadata format breakage → **critical**. Missing compat test → **high**. + +**Hotspots:** `src/Compiler/TypedTree/TypedTreePickle.fs`, `src/Compiler/Driver/CompilerImports.fs` + +*See also: `.github/instructions/TypedTreePickle.instructions.md` for detailed stream alignment rules.* + +--- + +### 13. Concurrency & Cancellation Safety Async and concurrent code must handle cancellation, thread safety, and exception propagation correctly. **CHECK:** -- Add observable reproduction tests for async cancellation edge cases. -- Preserve stack traces when re-raising exceptions in async control flow. -- Add tests verifying stack trace preservation through async exception handling paths. -- Keep Task-to-Async conversions explicit and searchable in the codebase. -- Check cancellation token status before executing async operations. -- Verify `Cache` replacement preserves exactly-once initialization semantics of `Lazy` wrappers. -- Do not rely on specific thread pool thread assignments in tests. +- Thread cancellation tokens through all async operations. All FCS requests must support cancellation. +- Ensure thread-safety for shared mutable state. Avoid global mutable state. +- Every lock must have a comment explaining what it protects. +- Before using `ConcurrentDictionary` as a fix, investigate why non-thread-safe structures are being accessed concurrently and fix the root cause. - Do not swallow `OperationCanceledException` in catch-all handlers. +- Do not add catch-all exception handlers. **Severity:** Race condition or data corruption → **critical**. Swallowed cancellation → **high**. Missing async test → **medium**. -**Hotspots:** `src/Compiler/Service/`, `src/Compiler/Facilities/`, `src/FSharp.Core/` +**Hotspots:** `src/Compiler/Service/`, `src/Compiler/Facilities/`, `src/FSharp.Core/`, `vsintegration/` --- -### 8. IDE Performance & Reactor Efficiency +### 14. Incremental Checking Correctness -The compiler service must respond to editor keystrokes without unnecessary recomputation. +Incremental checking must invalidate stale results correctly. Stale data causes timing-dependent glitches. **CHECK:** -- Verify that `IncrementalBuilder` caching prevents duplicate builder creation per project. -- Verify that changes do not trigger endless project rechecks by examining reactor traces. -- Analyze reactor trace output to identify unnecessary work triggered by keystrokes. -- Avoid triggering brace matching and navigation bar updates on non-structural keystrokes. -- Verify each diagnostic service makes exactly one request per keystroke. -- Investigate all causes of needless project rebuilding in IDE, not just the first one found. -- Test IDE changes on large solutions before merging. +- Avoid returning stale type checking results; prefer fresh results with cancellation support. +- Verify that caching prevents redundant checks and that cache invalidation is correct. +- Verify that project setup handles clean solutions or unrestored packages without silently dropping references. +- Ensure error loggers are consistent across all checking phases to avoid missing errors. + +**Severity:** Stale results causing glitches → **critical**. Missed invalidation → **high**. Missing cache verification → **medium**. + +**Hotspots:** `src/Compiler/Service/`, `src/Compiler/Driver/` + +--- -**Severity:** Endless recheck loop → **critical**. Unnecessary rebuild → **high**. Missing trace verification → **medium**. +### 15. Syntax Tree & Parser Integrity + +AST nodes must accurately represent source code. Parser changes are high-risk because they affect every downstream phase. + +**CHECK:** +- Update all pattern matches in tree-walking code when modifying AST node shapes. +- Remove default wildcard patterns in discriminated union walkers to catch missing cases at compile time. +- Gate parser changes behind the appropriate language version. +- Assess expression compatibility when introducing new syntax to avoid breaking existing code. + +**Severity:** Incorrect AST node → **critical**. Missing walker case → **high**. Ungated parser change → **high**. -**Hotspots:** `src/Compiler/Service/`, `src/FSharp.Compiler.LanguageServer/` +**Hotspots:** `src/Compiler/SyntaxTree/`, `src/Compiler/pars.fsy` --- -### 9. Editor Integration & UX +### 16. Exception Handling Discipline -VS extension and editor features must be correct, consistent, and not regress existing workflows. +Exception handling must be precise. Catch-all handlers and swallowed exceptions hide bugs. **CHECK:** -- Separate legacy project and CPS project modalities clearly before reasoning about changes. -- Revert problematic editor features to preview status when they cause regressions. -- Ship new warnings off-by-default with tests and an IDE CodeFix. -- Use existing compiler helpers for parameter name matching in IDE CodeFixes. -- Fix inconsistent paste indentation behavior that varies by cursor column position. -- Make project cache size a user-configurable setting. -- Use Roslyn mechanisms for analyzer scheduling while tuning delays. +- Raise internal compiler errors for unexpected type forms (`TType_ucase`, etc.) rather than returning defaults. +- Never swallow exceptions silently; handle `OperationCanceledException` specially. +- Do not suppress task completion by silently ignoring `TrySetException` failures. +- Returning `Unchecked.defaultof<_>` to swallow exceptions is dangerous — investigate and fix the root cause. +- At language service API boundaries (top-level entry points called by the IDE), catch all exceptions to prevent IDE crashes — but log them, don't silence them. +- Inside the compiler, do not add catch-all exception handlers — they hide bugs. -**Severity:** Editor regression in common workflow → **critical**. Inconsistent behavior → **high**. Missing CodeFix → **medium**. +**Severity:** Swallowed cancellation → **critical**. Catch-all handler → **high**. Missing error → **medium**. -**Hotspots:** `vsintegration/`, `src/FSharp.Compiler.LanguageServer/` +**Hotspots:** `src/FSharp.Core/`, `src/Compiler/Service/`, `src/Compiler/Optimize/` --- -### 10. Diagnostic Quality +### 17. Diagnostic Quality Error and warning messages are the compiler's user interface. They must be precise, consistent, and actionable. **CHECK:** -- Structure error messages as: error statement, then analysis, then advice. -- Only rename error identifiers to reflect actual error message content. -- Format suggestion messages as single-line when `--vserrors` is enabled. -- Use clearer wording in quotation-related error messages. -- Emit a warning rather than silently ignoring unsupported default parameter values. -- Enable unused variable warnings by default in projects to catch common bugs. -- Eagerly format diagnostics at production time to prevent parameter leakage in parallel checking. +- Structure error messages as: error statement, then analysis, then actionable advice. +- Emit a warning rather than silently ignoring unsupported values or options. +- Eagerly format diagnostics at production time to prevent parameter leakage across threads. **Severity:** Misleading diagnostic → **high**. Inconsistent format → **medium**. Wording improvement → **low**. @@ -227,19 +332,30 @@ Error and warning messages are the compiler's user interface. They must be preci --- -### 11. Feature Gating & Compatibility +### 18. Debug Experience Quality + +Debug stepping, breakpoints, and locals display must work correctly. Debug experience regressions silently break developer workflows. + +**CHECK:** +- Ensure debug points and sequence points enable correct stepping behavior. +- Verify debug stepping for loops, task code, and sequence expressions when changing control flow codegen. +- Solve debuggability problems generally through techniques that also apply to user-written code. + +**Severity:** Breakpoint regression → **critical**. Debug stepping regression → **high**. Missing manual verification → **medium**. + +**Hotspots:** `src/Compiler/CodeGen/`, `src/Compiler/AbstractIL/` + +--- + +### 19. Feature Gating & Compatibility New features must be gated behind language version checks. Breaking changes require RFC process. **CHECK:** - Gate new language features behind a `LanguageFeature` flag even if shipped as bug fixes. -- Ship experimental features off-by-default and discuss enablement strategy with stakeholders. +- Ship experimental features off-by-default. - Factor out cleanup changes separately from feature enablement. -- Require an fslang suggestion and RFC for language additions. -- Assess compound expression compatibility when introducing new syntax. - Reject changes that alter the C#/.NET visible assembly surface as breaking changes. -- Create an RFC retroactively for breaking changes that were merged without one. -- Treat all parser changes as high-risk and review thoroughly. **Severity:** Ungated breaking change → **critical**. Missing RFC → **high**. Bundled cleanup+feature → **medium**. @@ -247,90 +363,69 @@ New features must be gated behind language version checks. Breaking changes requ --- -### 12. Idiomatic F# & Naming +### Additional Dimensions (Evaluate When Applicable) -Compiler code should follow F# idioms and use clear, consistent terminology for internal concepts. +#### Compiler Performance Measurement -**CHECK:** -- Use 4-space indent before pipe characters in compiler code. -- Use `let mutable` instead of `ref` cells for mutable state. -- Use `ResizeArray` type alias instead of `System.Collections.Generic.List<_>`. -- Use pipeline operators for clearer data flow in service code. -- Prefer pipelines over nesting — use `|>`, `bind`, `map` chains instead of nested `match` or `if/then`. -- Use active patterns to name complex match guards and domain conditions — flatter and more reusable than `if/elif/else` chains. -- Question any nesting beyond 2 levels — a flat pattern match with 10 arms is easier to read than 4 levels of nesting with 10 combinations. Prefer wide over deep. -- Shadow variables when rebinding to improve code clarity. -- Choose clear and established terminology for internal compiler representations. -- Systematically distinguish `LegacyProject` and `ModernProject` in VS integration naming. +- Require `--times` output, benchmarks, or profiler data for performance claims. +- Do not compare F# build times directly to C# (Roslyn) — F# type inference and checking are structurally more expensive. Compare to previous F# baselines instead. +- Measure and report build time impact. -**Severity:** Deprecated construct → **medium**. Naming confusion → **medium**. Deep nesting → **medium**. Style deviation → **low**. +#### Memory Footprint Reduction -**Hotspots:** All `src/Compiler/` directories, `vsintegration/` +- Minimize heap allocations and GC pressure in hot paths. +- Consider the 2GB threshold for 32-bit VS processes. +- Use weak references for long-lived data. Use `ConditionalWeakTable` for caches keyed by GC-collected objects. +- Use struct tuples for retained data to reduce allocation overhead. ---- +#### C# Interop Fidelity -### 13. Code Structure & Technical Debt +- Ensure F# types and APIs are usable from C# without friction. +- Wait until the final shape of a C# feature is committed before matching it. -Keep the codebase clean. Extract helpers, remove dead code, and avoid ad-hoc patches. +#### Cross-Platform Correctness -**CHECK:** -- Flag ad-hoc `if condition then specialCase else normalPath` that looks like a band-aid rather than a systematic fix — the fix should be at the source, not patched at a consumer. -- Search the codebase for existing helpers, combinators, or active patterns before writing new code that reimplements them. -- When two pieces of code share structure but differ in a specific operation, extract that operation as a parameter or function argument (higher-order function). -- Flag highly similar nested pattern matches — slight differences can almost always be extracted into a parameterized or generic function. -- Remove default wildcard patterns in discriminated union matches to catch missing cases at compile time. -- Extract duplicated logic into a shared function with input arguments. -- Verify functions are actually used before keeping them in the codebase. -- Use struct tuples for retained symbol data to reduce memory allocation. -- Use `ConditionalWeakTable` for caches keyed by GC-collected objects. -- Fix compiler warnings like unused value bindings before merging. -- Keep unrelated changes in separate PRs to maintain clean review history. -- Follow existing abstraction patterns (e.g., `HasFSharpAttribute`) instead of ad-hoc checks. -- Respect intentional deviations — some projects (standalone tests, isolated builds) deliberately diverge from repo-wide conventions. Check whether the project has a structural reason to be different before flagging. - -**Severity:** Unreachable code path → **high**. Band-aid patch → **high**. Duplication → **medium**. Missing helper extraction → **low**. +- Test on all supported platforms; avoid platform-specific assumptions. +- Consider Mono, Linux, macOS when touching paths, resources, or runtime features. -**Hotspots:** `src/Compiler/Checking/`, `src/Compiler/Optimize/`, `src/Compiler/Service/` - ---- +#### Computation Expression Semantics -### 14. API Surface & Public Contracts +- Test deep recursion in CEs (seq, async, task) — tail call behavior depends on the builder implementation, not IL tail call instructions. +- Prefer designs that work for all CEs including user-defined builders, not just built-in ones. -FSharp.Core and FCS public APIs are permanent commitments. Changes must be deliberate. +#### Type Provider Robustness -**CHECK:** -- Reject changes that alter the C#/.NET visible assembly surface as breaking changes. -- Verify accuracy of IL-to-expression inversion and add systematic tests for quoted expression decompilation. -- Ensure decompiled expression trees would pass type checking when reconverted to code. -- Fix the root cause in `ConvExprPrim` rather than patching individual expression decompilation cases. -- Document the purpose of new public API arguments in XML docs. -- Update exception XML docs in `.fsi` files when behavior changes. -- Name potentially long-running FCS API methods with descriptive prefixes like `GetOptimized`. -- Systematically apply `InlineIfLambda` to every inlined function taking a lambda applied only once. +- Handle type provider failures gracefully without crashing the compiler — type providers run user code at compile time. +- Test type provider scenarios across target frameworks (desktop vs CoreCLR). -**Severity:** Unintended public API break → **critical**. Missing XML docs → **medium**. Naming convention → **low**. +#### Signature File Discipline -**Hotspots:** `src/FSharp.Core/`, `src/Compiler/Service/` +- Keep signature files in sync and use them to control API surface. +- `.fsi` files define the public contract; implementation files must match. ---- +#### Build Infrastructure -### 15. Build & Packaging Integrity +- Keep build scripts simple and cross-platform compatible. +- Prefer MSBuild targets over shell script wrappers. +- Eliminate unnecessary build dependencies. -Build configuration and NuGet packaging must be correct, reproducible, and not introduce unnecessary dependencies. +#### Code Structure & Technical Debt -**CHECK:** -- Avoid packaging internal files as content files in NuGet packages. -- Place non-library binaries in `tools/` directory rather than `lib/` in NuGet packages. -- Remove incorrect package dependencies from FSharp.Core nuspec. -- Be mindful of package size impact when adding dependencies to core assemblies. -- Run `build proto` or `git clean -xfd` when encountering stale build state issues. -- Preserve parallel build settings unless there is a documented reason to remove them. -- Verify nuspec comments match the correct package version. -- Fix dynamic loader to handle transitive native references for NuGet packages in FSI. +- Search the codebase for existing helpers, combinators, or active patterns before writing new code. +- When two pieces of code share structure but differ in a specific operation, extract that operation as a parameter (higher-order function). +- Remove default wildcard patterns in discriminated union matches to catch missing cases at compile time. +- Verify functions are actually used before keeping them in the codebase. +- Keep unrelated changes in separate PRs. +- Follow existing abstraction patterns (e.g., `HasFSharpAttribute`) instead of ad-hoc checks. +- Respect intentional deviations — some projects deliberately diverge from repo-wide conventions for structural reasons. -**Severity:** Wrong package layout → **high**. Stale build state → **medium**. Missing comment → **low**. +#### Naming & Formatting -**Hotspots:** `eng/`, `setup/`, `src/Compiler/Driver/` +- Choose precise, descriptive names that reflect actual semantics. +- Document or name tuple fields in AST union cases to clarify their meaning rather than leaving them as anonymous positional values. +- Use 4-space indent before pipe characters. Use `let mutable` instead of `ref` cells. Use `ResizeArray` type alias. +- Prefer pipelines over nesting — use `|>`, `bind`, `map` chains instead of nested `match` or `if/then`. +- Question any nesting beyond 2 levels — prefer wide over deep. --- @@ -348,47 +443,36 @@ Execute review in five waves, each building on the previous. ### Wave 1: Structural Scan 1. Verify every behavioral change has a corresponding test (Dimension 1). -2. Check feature gating — new features must have `LanguageFeature` guards (Dimension 11). -3. Verify no unintended public API changes (Dimension 14). -4. Check for binary compatibility concerns in pickle/import code (Dimension 3). +2. Check feature gating — new features must have `LanguageFeature` guards (Dimension 19). +3. Verify no unintended public API changes (Dimension 7). +4. Check for binary compatibility concerns in pickle/import code (Dimension 12). +5. If FSharp.Core is touched, apply FSharp.Core Stability checks (Dimension 2). ### Wave 2: Correctness Deep-Dive -1. Trace type checking changes through constraint solving and inference (Dimension 2). -2. Verify IL emission correctness with both Debug and Release optimizations (Dimension 4). -3. Validate AST node accuracy against source syntax (Dimension 5). -4. Check parallel determinism if checking/name-generation code is touched (Dimension 6). +1. Trace type checking changes through constraint solving and inference (Dimension 8). +2. Verify IL emission correctness with both Debug and Release optimizations (Dimension 5). +3. Validate AST node accuracy against source syntax (Dimension 15). +4. Check parallel determinism if checking/name-generation code is touched. +5. Verify optimization correctness — no semantic-altering transforms (Dimension 6). +6. Verify struct semantics if value types are involved (Dimension 9). ### Wave 3: Runtime & Integration -1. Verify concurrency safety — no races, proper cancellation, stack traces preserved (Dimension 7). -2. Check IDE reactor impact — no unnecessary rechecks or keystroke-triggered rebuilds (Dimension 8). -3. Validate editor feature behavior across project types (Dimension 9). -4. Review diagnostic message quality (Dimension 10). +1. Verify concurrency safety — no races, proper cancellation, stack traces preserved (Dimension 13). +2. Check IDE impact — no unnecessary rechecks or keystroke-triggered rebuilds (Dimension 10). +3. Verify overload resolution correctness if constraint solving changes (Dimension 11). +4. Check incremental checking correctness — no stale results (Dimension 14). +5. Review diagnostic message quality (Dimension 17). +6. Verify debug experience — stepping, breakpoints, locals (Dimension 18). ### Wave 4: Quality & Polish -1. Check F# idiom adherence and naming consistency (Dimension 12). -2. Look for code structure improvements — dead code, duplication, missing abstractions (Dimension 13). -3. Verify build and packaging correctness (Dimension 15). +1. Check code structure — dead code, duplication, missing abstractions. +2. Verify naming consistency and F# idiom adherence. +3. Verify build and packaging correctness. 4. Confirm all test baselines are updated and explained. ## Folder Hotspot Mapping -| Directory | Primary Dimensions | Secondary Dimensions | -|---|---|---| -| `src/Compiler/Checking/` | Type System (2), Parallel Compilation (6), Feature Gating (11) | Diagnostics (10), Code Structure (13) | -| `src/Compiler/CodeGen/` | IL Emission (4) | Test Coverage (1), Debug Correctness (4) | -| `src/Compiler/SyntaxTree/` | Parser Integrity (5) | Feature Gating (11), Code Structure (13) | -| `src/Compiler/TypedTree/` | Binary Compatibility (3), Type System (2) | Parallel Compilation (6) | -| `src/Compiler/Optimize/` | IL Emission (4), Code Structure (13) | Test Coverage (1) | -| `src/Compiler/AbstractIL/` | IL Emission (4) | Debug Correctness (4) | -| `src/Compiler/Driver/` | Binary Compatibility (3), Build (15) | Parallel Compilation (6) | -| `src/Compiler/Service/` | IDE Performance (8), Concurrency (7) | API Surface (14) | -| `src/Compiler/Facilities/` | Feature Gating (11), Concurrency (7) | Naming (12) | -| `src/Compiler/FSComp.txt` | Diagnostic Quality (10) | — | -| `src/FSharp.Core/` | API Surface (14), Concurrency (7) | Idiomatic F# (12) | -| `src/FSharp.Compiler.LanguageServer/` | IDE Performance (8), Editor UX (9) | Concurrency (7) | -| `vsintegration/` | Editor Integration (9) | IDE Performance (8), Naming (12) | -| `tests/` | Test Coverage (1) | All dimensions | -| `eng/`, `setup/` | Build & Packaging (15) | — | +See the `reviewing-compiler-prs` skill for the dimension selection table mapping files → dimensions. diff --git a/.github/agents/extraction-pipeline.md b/.github/agents/extraction-pipeline.md index e8e06b23c2..0daf23aad7 100644 --- a/.github/agents/extraction-pipeline.md +++ b/.github/agents/extraction-pipeline.md @@ -7,21 +7,37 @@ description: "Extracts review expertise from a GitHub user's history and generat Generate folder-scoped instructions, topic-scoped skills, and a multi-dimensional review agent from a GitHub user's public review history. Produces anonymized, deduplicated, Copilot-compatible `.github/` artifacts. -## Scale Warning +## Pipeline Overview + +``` +Phase 1: Collect Phase 2: Enrich Phase 3: Generate Phase 5: Verify +───────────────── ─────────────── ──────────────── ─────────────── +1.1 Index activity 2.1 Study repo 3.1 Raw creation 5.1-5.5 Quality checks + → gh_activity → feature_areas → agent.md (raw) 5.6 Codebase verify +1.2 Collect comments → ci_summary.txt → SKILL.md (raw) → agent.md (verified) + → user_comments 2.2 Classify → instructions 5.7 Overfitting check +1.3 Collect PR context → comment_analysis (raw) → final artifacts + → pr_contexts 2.2b Deduplicate 3.2 Anonymize +1.4 Reconcile paths → pr_rule_votes → *.md (anon) Phase 4 is NOT a pipeline + → user_comments 2.3 Synthesize 3.3 Anthropic guide step — it defines the + (paths updated) → dimensions.json → *.md (polished) review workflow EMBEDDED +1.5 Backup → principles.json 3.4 Deduplicate in the generated agent. + → JSON files → dim_evidence → *.md (deduped) +``` + +Each phase checks its output tables exist before running — skip completed phases on resume. + +## Scale This pipeline processes **thousands** of GitHub items (typically 3,000–10,000+ issues, PRs, discussions, and review comments spanning a decade). It will not fit in a single context window. -**Use sub-agents for everything.** The orchestrator manages SQLite state and dispatches work. Sub-agents do the heavy lifting: -- **Collection**: one sub-agent per repo × date range chunk. Parallelize aggressively (6+ concurrent agents). -- **Comment fetching**: one sub-agent per ~200 items. This is the most API-call-intensive phase. -- **Semantic analysis**: one sub-agent per ~200 comments. Each classifies and extracts rules. -- **Synthesis**: one sub-agent to read all analysis summaries (not raw data) and produce dimensions/principles. -- **Artifact generation**: one sub-agent per artifact type (instructions, skills, agent). -- **Validation/improvement**: one sub-agent per file or per concern. +**Use sub-agents for everything.** The orchestrator manages SQLite state and dispatches work. Sub-agents do the heavy lifting — see each phase for batch sizes and parallelism guidance. -Store all intermediate results in **SQLite** (queryable) and **JSON backup files** (recoverable). Sub-agents write results to files; the orchestrator imports into SQLite and dispatches next phase. Never rely on passing large datasets through context — use the filesystem. +**Context management:** Store all intermediate results in **SQLite** (queryable) and **JSON backup files** (recoverable). Sub-agents write results to files; the orchestrator imports into SQLite and dispatches the next phase. Never pass large datasets through agent context — use the filesystem. -Use `model: "claude-opus-4.6"` for all sub-agents. Use background mode (`mode: "background"`) so agents run in parallel and the orchestrator is notified on completion. +**Model selection:** Use the best available reasoning model (e.g., `claude-opus-4.6`) for classification and synthesis sub-agents. Fast/cheap models produce shallow rules. Collection sub-agents can use standard models. Use background mode so agents run in parallel. + +**Reliability:** After each batch of sub-agents completes, validate output files: >500 bytes, parseable JSON, contains entries for all assigned items. Re-dispatch incomplete outputs up to 3 times. Keep batch assignments to ≤5 batches per agent — agents given too much work produce placeholders or give up. ## Inputs @@ -29,7 +45,7 @@ Collect these before starting. If any are missing, ask the user via the `ask_use | Parameter | Required | Example | |-----------|----------|---------| -| `landing_repo` | yes | `dotnet/fsharp` — the repo receiving the artifacts | +| `landing_repo` | yes | `owner/repo` — the repo receiving the artifacts | | `username` | yes | GitHub username whose review history to extract | | `earliest_year` | no (default: 10 years back) | `2015` | | `reference_repos` | no | Additional repos to search (e.g., `dotnet/sdk`, `dotnet/runtime`) | @@ -50,16 +66,29 @@ If `reference_repos` are specified and the pipeline needs to search their code ( ## Phase 1: Data Collection +**Completeness is critical.** Do not sample — collect ALL activity. A reviewer who leaves one precise comment on a well-written PR teaches as much as 50 comments on a messy one. Sampling biases toward noisy PRs and misses the signal in clean approvals. + ### 1.1 Index all activity -Search each repo for issues, PRs, and discussions where `username` participated. GitHub search returns max 1000 results per query — split by date ranges to capture everything. +> **Sub-agents:** 1 per repo × date-range chunk (parallelize 6+) +> **Input:** GitHub API search results +> **Output:** SQLite `gh_activity` table, JSON backup per chunk +> **Resume:** Skip if `gh_activity` has rows for this repo+date range + +Search each repo for issues, PRs, and discussions where `username` participated. **Include ALL states** — open, closed, merged, and rejected PRs all carry learning potential. Rejected PRs often contain the strongest review opinions. -For each repo: +GitHub search returns max 1000 results per query — split by 1-year date ranges to capture everything. For high-volume users, split by 6 months. + +For each repo, run FOUR searches (not two — capture both commenter and author roles): ``` -search_issues: commenter:{username} created:{year_start}..{year_end} search_pull_requests: commenter:{username} created:{year_start}..{year_end} +search_pull_requests: author:{username} created:{year_start}..{year_end} +search_issues: commenter:{username} created:{year_start}..{year_end} +search_issues: author:{username} created:{year_start}..{year_end} ``` +**Own PRs are first-class data.** When the user is the PR author, their PR description reveals design intent, priorities, and rationale that never appears in review comments. Tag each item with the user's role: `reviewer`, `author`, or `both`. + For discussions (if the repo uses them), use the GitHub GraphQL API: ```graphql query { @@ -69,25 +98,38 @@ query { } ``` -Store in SQLite (`gh_activity` table): repo, type (issue/pr/discussion), number, title, state, created_at, updated_at, labels, url, author. +Store in SQLite (`gh_activity` table): repo, type (issue/pr/discussion), number, title, state, created_at, updated_at, labels, url, author, user_role. -Parallelize across repos and date ranges. Use sub-agents for large volumes. +Parallelize across repos and date ranges. Use sub-agents for large volumes. Paginate ALL results — do not stop at page 1. ### 1.2 Collect actual comments -For each indexed item, fetch the user's actual comments: -- **Issues**: `issue_read` → `get_comments` → filter to username -- **PRs — general comments**: `pull_request_read` → `get_comments` → filter to username +> **Sub-agents:** 1 per ~15 PRs (parallelize aggressively) +> **Input:** `gh_activity` table (PR/issue numbers to fetch) +> **Output:** SQLite `user_comments` table, JSON backup per batch +> **Resume:** Skip PRs already in `user_comments` +> **Validation:** Each output file >500 bytes, contains entries for all 15 assigned PRs + +For EVERY indexed item (not a sample), fetch the user's actual words. **All comment types matter:** + +- **PR descriptions** (when user is author): `pull_request_read` → `get` → save body. These reveal design intent and priorities — often the most valuable content. +- **PRs — general comments**: `pull_request_read` → `get_comments` → filter to username. This is the primary comment channel for many reviewers. - **PRs — review comments** (code-level, with file path + diff hunk): `pull_request_read` → `get_review_comments` → filter to username -- **PRs — reviews** (approval/request-changes with summary body): `pull_request_read` → `get_reviews` → filter to username. These carry the reviewer's top-level verdict and summary — often the most opinionated content. -- **Discussions**: Use GraphQL to fetch comment nodes filtered to username. Discussion comments often contain design rationale and architectural decisions. +- **PRs — reviews** (approval/request-changes with summary body): `pull_request_read` → `get_reviews` → filter to username. These carry the reviewer's top-level verdict and summary — often the most opinionated content. Skip reviews with empty bodies. +- **Issues — body** (when user is author): save the issue body as a comment. +- **Issues — comments**: `issue_read` → `get_comments` → filter to username +- **Discussions**: Use GraphQL to fetch comment nodes filtered to username. -Store in SQLite (`user_comments` table): comment_id, activity_id, repo, comment_type (issue_comment, review_comment, pr_comment, review_summary, discussion_comment), body, created_at, file_path, diff_hunk, url. +Store in SQLite (`user_comments` table): comment_id, activity_id, repo, comment_type (pr_description, issue_description, issue_comment, review_comment, pr_comment, review, discussion_comment), body, created_at, file_path, diff_hunk, url. -This is the most API-intensive phase. Batch into sub-agents by date range. Handle rate limits with retry. +This is the most API-intensive phase. Batch into sub-agents of ~15 PRs each (not 15 comments — each agent handles 15 PRs and fetches all comment types for each). When fetching comments for a single PR, paginate through all pages (`get_review_comments` returns max 100 per page). Parallelize aggressively. Handle rate limits with retry and exponential backoff. ### 1.3 Collect PR context +> **Sub-agents:** 1 per ~15 PRs (can share with 1.2 agents) +> **Input:** `gh_activity` table (PRs with review comments) +> **Output:** SQLite `pr_contexts` table (files_changed, labels, description per PR) + For PRs with review comments, also collect: - Files changed (`get_files`): path, additions, deletions, status - PR labels and description @@ -96,14 +138,19 @@ This maps comments to code areas. ### 1.4 Cross-validate against current codebase -Collected data references files and folders as they existed at the time of the comment — migrations and refactorings happen. Before enrichment, reconcile all file paths: +> **Sub-agents:** 1 (or orchestrator directly) +> **Input:** `user_comments` table, local repo checkout +> **Output:** `user_comments` table (paths updated in-place), `path_mapping` table, `obsolete_terms` list +Collected data references files, folders, and terminology as they existed at the time of the comment — migrations and refactorings happen. Reconcile before enrichment: + +**File paths:** 1. Extract all unique file paths from collected comments (review comments have `file_path`, PR files have `path`). 2. For each path, check if it exists in the current repo (`Test-Path` or `glob`). 3. If missing, search for the filename in its current location (files get moved between folders). Update the path if found. 4. If the file was deleted entirely, keep the comment's essence (the rule it teaches) but drop the file pointer. The rule may still apply to successor code. -This prevents generating instructions that point at nonexistent files. +**Technical terms:** `grep` every technical term used in comments (function names, type names, internal concepts) against the current codebase. Terms with zero matches are obsolete — do not use them in generated artifacts. ### 1.5 Backup @@ -115,40 +162,75 @@ Write all collected data as JSON to a backup directory (e.g., `{landing_repo}-an ### 2.1 Study the landing repo +> **Sub-agents:** 1 (explore agent) +> **Input:** Local repo checkout (`src/`, `tests/`, `eng/`, `.github/`, CI configs) +> **Output:** SQLite `feature_areas` table, `ci_summary.txt`, `existing_artifacts.txt` + Before analyzing comments, understand the codebase: - Directory structure → feature area mapping - Existing documentation (specs, wiki, guides) - Existing `.github/` artifacts (instructions, skills, agents, copilot-instructions.md, AGENTS.md) - Technology stack, conventions, key files +- **CI configuration** — analyze CI config files (GitHub Actions, Azure Pipelines, Jenkins, etc.) and produce a CI coverage summary: what CI already enforces (platform coverage, test suites, formatting, linting, etc.). Provide this summary to every classification sub-agent in §2.2. -Store as a feature area reference table in SQLite. +Store feature areas in SQLite: `CREATE TABLE feature_areas (area_name TEXT, folder_glob TEXT, description TEXT)`. Store CI summary as a text file. ### 2.2 Semantic analysis +> **Sub-agents:** 1 for bootstrap, then ~N/15 for classification (where N = number of PRs with comments) +> **Input:** `user_comments` table, `feature_areas` table, `ci_summary.txt` +> **Output:** SQLite `comment_analysis` table, `taxonomy.json` +> **Context per sub-agent:** taxonomy + CI summary + 15 PR packets (all comments on each PR) + For each collected comment, classify using a sub-agent (Opus). **Do not use a hardcoded category list** — derive categories from the data: -1. **Bootstrap pass**: Take a random sample of ~200 comments. Ask a sub-agent to read them and propose a category taxonomy that fits this specific reviewer and codebase. The agent should identify recurring themes, name them, and define each in one sentence. Expect 15–40 categories to emerge. +1. **Bootstrap pass**: Take a stratified sample of ~300 comments: proportional by year, at least 5 per major feature area from §2.1, and at least 20 each of review_comments, pr_descriptions, and issue_comments. Ask a sub-agent to read them and propose a category taxonomy. The agent should identify recurring themes, name them, and define each in one sentence. Expect 15–40 categories to emerge. After deriving the taxonomy, cross-check it against the feature area table — if any area representing >10% of the codebase has zero categories, re-sample with enforced coverage. -2. **Classification pass**: Using the derived taxonomy, classify all comments in batches (~200 per sub-agent). For each comment extract: +2. **Classification pass**: Using the derived taxonomy, classify all comments in batches (~15 PR packets per sub-agent, where each packet includes all comments on that PR). For each comment extract: - **Categories** (one or more, from the derived taxonomy) - **Feature area**: map to the landing repo's code structure (from 2.1) - **File/folder**: which code path does this apply to - - **Sentiment**: approval, concern, suggestion, question, blocking - **Severity**: trivial, minor, moderate, major, critical - - **Focus point**: what specifically is being addressed - - **Derived rule**: actionable rule extracted from the comment + - **Derived rule**: actionable rule extracted from the comment, phrased as a generalizable principle — not tied to the specific PR 3. **Taxonomy refinement**: After the first full pass, review category distribution. Merge categories with <5 occurrences into broader ones. Split categories with >500 occurrences if they contain distinct sub-themes. Re-classify affected comments. +**Anti-overfitting rules for classification:** +- **Normalize by PR, not by comment.** A PR with 50 comments gets weight=1, same as a PR with 1 comment. Count how many PRs a rule appears in, not how many comments mention it. The reviewer saying something once on a clean PR means the same as repeating it 10 times on a messy one. +- **Generalize, don't transcribe.** The derived rule must be applicable to a future PR the classifier has never seen. "Always normalize type representations before pattern matching" is good. "Call helper X on line 47 of file Y" is overfitted. +- **Distinguish reviewer opinion from CI enforcement.** If the reviewer says "please add tests", that's a review rule. If the reviewer says "run tests on Linux and Windows", that might just mean "CI should cover this" — not a rule for human reviewers. Check the CI summary from §2.1: if CI already enforces it, don't encode it as a review rule. +- **Distinguish design guidance from implementation instruction.** "Gate features behind feature flags" is design guidance (always applicable). "Use helper X after calling Y" is an implementation instruction for a specific code path (only applicable when touching that code). + Store in SQLite (`comment_analysis` table). -Process in batches. Use sub-agents — each handles ~200 comments. Run in parallel. +Process in batches. Use sub-agents — each handles ~15 PR packets with full context. Run in parallel. + +### 2.2b Deduplication (enforce PR-normalization) + +> **Sub-agents:** None (orchestrator runs SQL directly) +> **Input:** `comment_analysis` table +> **Output:** `pr_rule_votes` table (1 vote per rule per PR) + +Before synthesis, collapse per-comment rows into per-PR votes: + +```sql +CREATE TABLE pr_rule_votes AS +SELECT DISTINCT activity_id, derived_rule, category, feature_area +FROM comment_analysis; +``` + +This ensures a PR with 50 comments gets weight=1, same as a PR with 1 comment. The synthesis agent in §2.3 reads `pr_rule_votes`, never raw `comment_analysis`. ### 2.3 Clustering -Aggregate analysis results to identify: +> **Sub-agents:** 1 (Opus, synthesis) +> **Input:** `pr_rule_votes` table, `taxonomy.json`, `feature_areas` table, `ci_summary.txt` +> **NOT available:** raw `user_comments`, JSON backups — synthesis works only with classified, deduplicated data +> **Output:** `dimensions.json`, `principles.json`, `folder_hotspots.json`, SQLite `dimension_evidence` table + +Aggregate the deduplicated `pr_rule_votes` to identify: -1. **Review dimensions**: Recurring themes across hundreds of comments. Each dimension should be specific enough to act on, broad enough to apply across many PRs. Target 8–24 dimensions. +1. **Review dimensions**: Recurring themes across many PRs. Each dimension should be specific enough to act on, broad enough to apply across many PRs. Target 8–24 dimensions. If any single dimension accounts for >40% of total PR-votes, flag it for splitting. 2. **Folder hotspots**: Which directories receive the most review feedback, and which dimensions apply there. @@ -156,11 +238,19 @@ Aggregate analysis results to identify: 4. **Repo-specific knowledge**: Rules that are unique to this codebase, not generic programming advice. -Use a synthesis sub-agent (Opus) that reads all analysis summaries and produces: -- Dimension list with rules, severity, evidence +The synthesis sub-agent receives: +- The taxonomy from §2.2 step 1 +- The `pr_rule_votes` table (deduplicated: one vote per rule per PR) +- The `feature_areas` table from §2.1 +- The CI coverage summary from §2.1 + +The synthesis agent MUST NOT access raw comment data (`user_comments` table or JSON backups). It works only with classified, deduplicated data. + +It produces: +- Dimension list with rules, severity, and PR-count evidence - Folder → dimension mapping - Principle list -- Knowledge area reference table +- A `dimension_evidence` table: `(dimension, pr_count, example_prs)` for verification in Phase 5 --- @@ -168,6 +258,11 @@ Use a synthesis sub-agent (Opus) that reads all analysis summaries and produces: ### 3.1 Raw creation +> **Sub-agents:** 1 per artifact type (3 total) +> **Input:** `dimensions.json`, `principles.json`, `folder_hotspots.json`, existing `.github/` artifacts +> **Output:** `agent.md` (raw), `SKILL.md` (raw), `*.instructions.md` (raw) +> **Anti-overfitting:** generation sub-agents must follow the same rules from §2.2 + Generate three artifact types: #### Instructions (`.github/instructions/*.instructions.md`) @@ -186,20 +281,28 @@ Generate three artifact types: - Reference docs, don't reproduce them #### Review Agent (`.github/agents/{agent_name}.md`) -- Single source of truth for the review methodology -- Contains: overarching principles, all dimensions inline (with rules + CHECK flags), folder hotspot mapping, review workflow +- Single source of truth for dimension definitions and review workflow — all CHECK rules must be inline +- Contains: overarching principles, all dimensions inline (with rules + CHECK flags), review workflow +- The folder→dimension routing table belongs in the skill (operational configuration, not methodology) — the agent references the skill during Wave 1 to select dimensions - The review workflow is 5 waves (see below) +- The artifact-generation sub-agent must follow the same anti-overfitting rules from §2.2: every CHECK item must be a generalizable principle applicable to future PRs about features that don't exist yet **Commit** after raw creation. ### 3.2 Anonymize +> **Input:** `*.md` (raw) +> **Output:** `*.md` (anonymized) + Remove all personal names, comment counts, PR number references, evidence statistics, "distilled from" language. The artifacts should read as authoritative engineering guidance, not data analysis output. **Commit** after anonymization. ### 3.3 Improve per Anthropic guide +> **Input:** `*.md` (anonymized) +> **Output:** `*.md` (polished) + Apply https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices: - `name`: gerund form, lowercase+hyphens - `description`: third person, specific triggers, ≤1024 chars @@ -213,18 +316,23 @@ Apply https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-pra ### 3.4 Deduplicate and cross-reference +> **Input:** `*.md` (polished), existing `.github/` content +> **Output:** `*.md` (deduplicated) — committed to repo + Compare new artifacts against existing `.github/` content: - Check trigger overlap between new and existing skills -- Check body overlap (same howto in two places) -- Resolve: if complementary (how vs when), add cross-references. If duplicate, merge or delete. -- Instructions must not repeat AGENTS.md or copilot-instructions.md +- Check body overlap (same content in two places) — if the same concept appears in both agent and skill, keep it in the agent (source of truth) and have the skill point to it +- Instructions must not repeat AGENTS.md, copilot-instructions.md, or the agent's CHECK items verbatim — instructions are for concise auto-loaded reminders only - All doc links verified to exist on disk +- The YAML `description` field is how the model picks from 100+ skills — invest in keyword-rich, third-person, specific trigger descriptions **Commit** after deduplication. --- -## Phase 4: Review Workflow (embedded in the agent) +## Phase 4: Review Workflow Specification (embedded in the generated agent) + +This section defines the workflow that the generated review agent will follow at runtime. The pipeline does not execute this workflow — it embeds it as instructions in the agent artifact. The review agent runs a 5-wave process when invoked: @@ -263,11 +371,11 @@ Launch **one sub-agent per dimension** (parallel batches of 6). Each evaluates e Sub-agent instructions: -> Report `$DimensionName — LGTM` when the dimension is genuinely clean. +> Report `$DimensionName — LGTM` when the dimension is genuinely clean. Do not explain away real issues to produce a clean result. > -> Report an ISSUE only when you can construct a **concrete failing scenario**: a specific input, a specific call sequence, a specific state that triggers the bug. No hypotheticals. +> Report an ISSUE only when you can construct a **concrete failing scenario**: a specific input, a specific call sequence, a specific state that triggers the bug. No hypotheticals — "this might be a problem in theory" is not a finding. > -> Read the **PR diff**, not main — new files only exist in the PR branch. +> Read the **PR diff**, not main — new files only exist in the PR branch. Never verify findings against `main`; the code you're reviewing only exists in `refs/pull/{pr}/head`. > > Include exact file path and line range. Verify by tracing actual code flow. @@ -364,22 +472,41 @@ Verify the three layers work together: - No body overlap between instructions and AGENTS.md/copilot-instructions.md - Agent doesn't repeat AGENTS.md content ---- +### 5.6 Codebase verification + +> **Sub-agents:** 1 per dimension (parallelize) +> **Input:** `agent.md` (deduplicated) CHECK items, local repo (`src/`, `tests/`, `eng/`), CI config +> **Output:** `confusion_audit.json` (grade per item), `agent.md` (verified) +> **Feedback loop:** If >20% grade C/D → re-run Phase 2.2 + 2.3 + 3 with stronger anti-overfitting + +For each dimension in the generated agent, dispatch a fresh-context sub-agent that reads: +1. The dimension's CHECK items +2. The actual codebase (`src/`, `tests/`, `eng/`) +3. The repo's CI configuration + +The sub-agent answers for every CHECK item: +- **Does this term exist in the codebase?** `grep` every function name, type name, and concept. Zero matches = obsolete, remove or replace. +- **Does CI already enforce this?** If the rule says "test on multiple platforms" and CI runs on 3 OSes, drop the rule — CI handles it. +- **Is this generalizable?** Could a reviewer apply this to a PR implementing a feature that doesn't exist yet? If it only makes sense for one specific code path, either generalize it or move it to a code comment. +- **Is the "why" clear?** Would a developer who has never seen this codebase understand what goes wrong if they violate this rule? If not, add a one-sentence rationale. -## Lessons Learned (encoded in this process) +Grade each item: **A** (clear, verified), **B** (needs rationale — add it), **C** (overfitted — generalize or remove), **D** (obsolete/contradictory — rewrite or remove). -Failure modes observed during development. The process above accounts for them, but they're listed for awareness: +**Targets:** ≥80% grade A, 0% grade C/D. Fix all B/C/D items before finalizing. -1. **Nodder bias**: Telling sub-agents "LGTM is the best outcome" caused them to explain away real issues. The correct framing: "LGTM when genuinely clean. Do not explain away real issues." +**Feedback loop:** If >20% of items are grade C/D, the problem is in classification (Phase 2), not just in the artifact. Re-run §2.2 classification for the affected categories with strengthened anti-overfitting prompts, then re-run §2.3 synthesis and §3 artifact generation. Fixing artifacts alone treats symptoms. -2. **Nitpicker bias**: Without LGTM guidance, sub-agents generated 25+ findings including hypotheticals. The fix: require concrete failing scenarios — no "maybe in theory." +**Commit** after verification fixes. -3. **Wrong branch verification**: Verification agents checked `main` instead of the PR branch, disputing real findings because new files didn't exist on `main`. The fix: always verify against PR diff or `refs/pull/{pr}/head`. +### 5.7 Overfitting verification -4. **Static-only analysis**: Reading code without tracing execution missed whether issues were real. The fix: Wave 2 requires active validation — build, test, write PoCs, simulate execution traces. +> **Input:** All `*.md` (verified), `dimension_evidence` table from §2.3 +> **Output:** Final artifacts — committed to repo -5. **Duplicate skills**: Two skills covering the same topic emerged. The fix: single source of truth in one file; others are slim pointers. +Final check on the complete artifact set: +- No rules that reproduce what CI already enforces +- No rules referencing specific function names or line numbers unless those functions are long-lived stable APIs (verified by grep in 5.6) +- Every CHECK item is phrased as a generalizable principle, not a transcription of one PR's feedback +- Dimension frequency was counted by PRs, not by comments — a PR with 50 comments counts the same as one with 1 comment -6. **Separate reference files get skipped**: Content in a separate file was not reliably read by the model. The fix: inline critical content in the agent file — it's loaded on invocation, guaranteed to be read. -7. **Description is everything for discovery**: The YAML `description` field is how the model picks from 100+ skills. Invest in keyword-rich, third-person, specific trigger descriptions. diff --git a/.github/instructions/ExpertReview.instructions.md b/.github/instructions/ExpertReview.instructions.md index 67dbb5d73e..94ad29f5db 100644 --- a/.github/instructions/ExpertReview.instructions.md +++ b/.github/instructions/ExpertReview.instructions.md @@ -5,21 +5,28 @@ applyTo: # Compiler Review Rules -## Test discipline - -- Add tests for every behavioral change before merging — missing coverage is the most common review blocker. -- Explain all new errors in test baselines and confirm each is expected behavior, not a regression. - ## Type safety -- Use `tryTcrefOfAppTy` or the `AppTy` active pattern instead of direct `TType` pattern matches — direct matches miss erased types. +- Always call `stripTyEqns` before pattern matching on types. Use `AppTy` active pattern, not `TType_app` directly. - Remove default wildcard patterns in discriminated union matches so the compiler catches missing cases. +- Raise internal compiler errors for unexpected type forms rather than returning defaults. ## Feature gating - Gate every new language feature behind a `LanguageFeature` flag and ship off-by-default until stable. - Factor cleanup changes into separate commits from feature enablement. +- Major language changes require an RFC before implementation. + +## Binary compatibility + +- Codegen changes that depend on new FSharp.Core functions must guard against older FSharp.Core versions. +- Do not alter the C#/.NET visible assembly surface without treating it as a breaking change. + +## Concurrency + +- Thread cancellation tokens through all async operations; uncancellable long-running operations are blocking bugs. +- Do not add catch-all exception handlers. Never swallow `OperationCanceledException`. -## Code generation +## Performance -- Strip debug points when matching on `Expr.Lambda` during code generation to prevent IL stack corruption. +- Performance claims require `--times` output, benchmarks, or profiler evidence. diff --git a/.github/skills/expert-review/SKILL.md b/.github/skills/expert-review/SKILL.md index dc14c8185e..8f2458b96a 100644 --- a/.github/skills/expert-review/SKILL.md +++ b/.github/skills/expert-review/SKILL.md @@ -1,6 +1,6 @@ --- name: reviewing-compiler-prs -description: "Performs multi-agent, multi-model code review of F# compiler PRs across 15 dimensions including type checking, IL emission, binary compatibility, parallel determinism, and IDE performance. Dispatches parallel assessment agents per dimension, consolidates with cross-model agreement scoring, and filters false positives. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." +description: "Performs multi-agent, multi-model code review of F# compiler PRs across 19 dimensions including type checking, IL emission, binary compatibility, and IDE performance. Dispatches parallel assessment agents per dimension, consolidates with cross-model agreement scoring, and filters false positives. Invoke when reviewing compiler changes, requesting expert feedback, or performing pre-merge quality checks." --- # Reviewing Compiler PRs @@ -10,22 +10,28 @@ Full dimension definitions and CHECK rules live in the `expert-reviewer` agent. ## When to Invoke - PR touches `src/Compiler/` — invoke the `expert-reviewer` agent -- PR touches `src/FSharp.Core/` — focus on API Surface & Concurrency -- PR touches `vsintegration/` or `LanguageServer/` — focus on IDE Performance & Editor UX -- PR touches `tests/` only — quick check: baselines explained? Cross-TFM coverage? +- PR touches `src/FSharp.Core/` — focus on FSharp.Core Stability, API Surface, Backward Compat, XML Docs +- PR touches `vsintegration/` or `LanguageServer/` — focus on IDE Responsiveness, Concurrency, Memory +- PR touches `tests/` only — quick check: baselines explained? Cross-TFM coverage? Tests actually assert? +- PR touches `eng/` or build scripts — focus on Build Infrastructure, Cross-Platform ## Dimension Selection | Files Changed | Focus Dimensions | |---|---| -| `Checking/`, `TypedTree/` | Type System, Parallel Compilation, Feature Gating | -| `CodeGen/`, `AbstractIL/`, `Optimize/` | IL Emission, Debug Correctness, Test Coverage | -| `SyntaxTree/`, `pars.fsy` | Parser Integrity, Feature Gating | +| `Checking/`, `TypedTree/` | Type System, Overload Resolution, Struct Awareness, Feature Gating | +| `CodeGen/`, `AbstractIL/` | IL Emission, Debug Experience, Test Coverage | +| `Optimize/` | Optimization Correctness, IL Emission, Test Coverage | +| `SyntaxTree/`, `pars.fsy` | Parser Integrity, Feature Gating, Typed Tree Discipline | | `TypedTreePickle.*`, `CompilerImports.*` | Binary Compatibility (highest priority) | -| `Service/`, `LanguageServer/` | IDE Performance, Concurrency | +| `Service/` | FCS API Surface, IDE Responsiveness, Concurrency, Incremental Checking | +| `LanguageServer/` | IDE Responsiveness, Concurrency | +| `Driver/` | Build Infrastructure, Incremental Checking, Cancellation | +| `Facilities/` | Feature Gating, Concurrency | | `FSComp.txt` | Diagnostic Quality | -| `FSharp.Core/` | API Surface, Concurrency | -| `vsintegration/` | Editor Integration | +| `FSharp.Core/` | FSharp.Core Stability, Backward Compat, XML Docs, RFC Process | +| `vsintegration/` | IDE Responsiveness, Memory Footprint, Cross-Platform | +| `eng/`, `setup/`, build scripts | Build Infrastructure, Cross-Platform | ## Multi-Model Dispatch @@ -50,9 +56,11 @@ Dispatch one agent per selected dimension. For high-confidence reviews, assess e ## Self-Review Checklist 1. [ ] Every behavioral change has a test -2. [ ] Test baselines updated with explanations -3. [ ] New language features have a `LanguageFeature` guard -4. [ ] No unintended public API surface changes -5. [ ] Cleanup changes separate from feature enablement -6. [ ] Tests pass in both Debug and Release -7. [ ] Error messages follow: statement → analysis → advice +2. [ ] FSharp.Core changes maintain binary compatibility +3. [ ] No unintended public API surface changes +4. [ ] New language features have a `LanguageFeature` guard and RFC +5. [ ] No raw `TType_*` matching without `stripTyEqns` +6. [ ] Cancellation tokens threaded through async operations +7. [ ] Cleanup changes separate from feature enablement + +Full dimension CHECK rules are in the `expert-reviewer` agent.