Skip to content

Latest commit

 

History

History
1007 lines (562 loc) · 102 KB

File metadata and controls

1007 lines (562 loc) · 102 KB

Code Agent Platform

Implementation Plan --- C# and TypeScript/React --- Phase 1

Table of Contents

1. Overview

1.5. Application Context

1.5.1 Technology Stack

1.5.2 Privacy Boundary

1.5.3 Offline Degradation Matrix

2. Goals

3. Technology Stack

3.1 Local Embedding Model Selection

4. Architecture Layers

4.5. Concurrency Architecture

4.5.1 Thread and Task Map

4.5.2 WAL Concurrency Strategy

4.5.3 Child Process Management and IPC Protocol

4.5.4 Rust Crate Dependencies

4.5.5 Tauri Command Boundary

5. Graph Schema (SQLite)

5.1 Node Types

5.2 Node Fields

5.3 Edge Types

5.4 File Nodes

5.5 Language-Specific Considerations

6. Ingest Pipeline

6.1 Language Detection

6.2 Graph Construction (Eager)

6.3 Eager Embedding (At Index Time)

6.4 Incremental Updates

6.5 Edge Invalidation Decision Matrix

7. Storage Layout

7.1 Required PRAGMAs

7.2 FTS5 Configuration

7.3 Vector Index Strategy

7.4 Index Integrity and Recovery

7.5 Write Queue and Backpressure

7.6 Database Size and Distribution

7.7 Configuration

7.8 Schema Migrations

8. Retrieval Pipeline

8.1 Query Analysis

8.2 Parallel Retrieval

8.3 Simplified Reranker

8.4 Context Assembly

8.5 Cross-Cutting Query Support

9. RLM Orchestration Layer

9.1 Root LM Responsibilities

9.2 Sub-LM Responsibilities

9.3 Tool Set (Root LM)

9.4 Cost Hierarchy

9.5 Root LM System Prompt

10. Language Adapters

11. Implementation Phases

12. Key Risks and Mitigations

13. Third-Party Licenses

14. Out of Scope for Phase 1

15. Testing Strategy

1. Overview

This plan specifies a code agent indexing and retrieval platform for C# and TypeScript/React codebases. It combines syntactic parsing (tree-sitter), compiler-grade semantic analysis (Roslyn / TS Language Service), and hybrid retrieval (vector + BM25 + qualified-name). The engine is exposed as an MCP (Model Context Protocol) server, allowing any MCP-compatible client — LLM agents, desktop apps, IDE extensions — to navigate and reason over large codebases without loading them into context.

Core design principles:

  • The codebase is an external environment navigated through tools, never loaded wholesale into a model context window.

  • Semantic search works from day one via eager embeddings (from names + signatures).

  • Vector search provides semantic entry points; graph traversal provides exact relational navigation. Both are required.

  • All persistent index state is a single SQLite file: plain tables for the graph, sqlite-vec for vectors, FTS5 for BM25 text search. The application additionally ships with non-index assets: the ONNX embedding model file and tokenizer configuration. These are read-only application assets, not user data.

1.5. Application Context

The Code Agent is a desktop application for agentic coding with LLM assistance. All source code is local only --- private code never leaves the user's machine for indexing, parsing, or retrieval operations.

1.5.1 Technology Stack

  • Desktop shell: Tauri (Rust core + system webview). The Rust process owns all local computation: file watching, parsing, indexing, embedding, and retrieval.

  • Frontend: Vite + React.js, rendered in the Tauri webview. Communicates with the Rust backend via Tauri's IPC command system.

Frontend scope note: This plan covers the Rust backend (indexing, retrieval) and the MCP server boundary. Frontend implementation (React components, UI layout, state management) is a separate workstream. Where this plan references specific UI behaviors (progress indicators, safe-mode warnings, query input, result display), these are requirements on the MCP tool API --- the backend must expose the data; the frontend must render it. Frontend deliverables, timelines, and tests are tracked separately.

  • Remote backend: Not required by the index engine. The engine operates entirely locally. LLM integration, if needed, is the responsibility of the MCP client (e.g., an LLM agent that connects to the MCP server).

1.5.2 Privacy Boundary

The index engine operates entirely locally. The SQLite index, embeddings, full symbol graph, and all retrieval results never leave the user's machine. The engine is exposed via an MCP server that runs locally; MCP clients connect to it to query the index and navigate the codebase.

Graph construction, embedding generation, retrieval, file system access, and all tool execution run in the Rust process on the user's machine. The engine makes no network calls.

1.5.3 Offline Operation

All engine capabilities are fully local and require no network access:

  • Graph construction (tree-sitter + Roslyn/TS LS) --- fully local

  • Embedding generation --- local ONNX model

  • All three retrieval channels (vector, BM25, qualified-name) --- local SQLite

  • Graph traversal and navigation --- local SQLite

  • File system access --- scoped to repo root

2. Goals


Goal Success Criterion

Accurate symbol resolution Call sites, definitions, references resolved with compiler-grade precision via Roslyn / TS Language Service

High recall for vague queries Semantic vector search surfaces relevant nodes for imprecise natural-language prompts

Precise filtering for exact queries "Find all callers of AuthService.Authenticate" returns complete, correct results

Interactive performance Sub-second to low-second responses for common queries (search + traversal + context assembly)

Incremental & cost-efficient Re-index only changed nodes; cache embeddings by chunk hash

Privacy / compliance The index, embeddings, and full graph never leave the local machine. The remote backend receives only scoped source snippets within LLM prompts, subject to data minimization rules (snippet caps, optional secret redaction, no persistent storage of prompt content)

Language extensibility Adapter pattern: adding a language requires only a new adapter, not core changes


3. Technology Stack

Using SQLite exclusively eliminates external service dependencies. The entire index is a single portable file stored locally in .codeagent/ (gitignored by default). For team sharing or CI scenarios, the index can be exported as a distributable artifact or rebuilt from source.


Layer Technology Rationale

Syntactic parsing tree-sitter (C# + TS grammars) Fast, language-agnostic, consistent ASTs. First-class Rust bindings (tree-sitter crate).

C# semantic analysis Roslyn (Microsoft.CodeAnalysis) Compiler-grade symbol resolution, call graphs, overrides, type inference

TS semantic analysis TypeScript Language Service Full type resolution, call sites, module resolution, JSX support

Graph storage SQLite (plain tables) Adjacency lists, symbol tables, metadata in a single portable file

Vector storage sqlite-vec Embedding search co-located with graph; no separate vector DB

Full-text / BM25 SQLite FTS5 Exact token search built into SQLite

Embedding generation all-MiniLM-L6-v2 via ONNX Runtime (local) 384-dim, ~80 MB, ~5--10 ms per input on modern CPU. Runs entirely in-process via the ort crate (Rust ONNX bindings). No network calls, no Python dependency. Tokenizer via the tokenizers crate (pure Rust).

API surface MCP server (stdio + SSE) Universal tool interface for LLMs, apps, and IDE extensions


3.1 Local Embedding Model Selection

The default embedding model is all-MiniLM-L6-v2 (384 dimensions, Apache 2.0 license, ~80 MB ONNX export). Selection criteria:

  • CPU performance: ~5--10 ms per input on any CPU less than 5 years old. For 100K nodes at batch size 64, initial indexing completes in 1--2 minutes.

  • Code + NL quality: Strong retrieval performance on mixed code/natural-language queries for its size class.

  • No network dependency: The ONNX model file ships with the application binary. Embedding generation is fully offline.

  • Dimensionality: 384 dimensions keeps sqlite-vec index size manageable (~150 MB for 100K nodes).

Alternatives for future evaluation: nomic-embed-text-v1.5 (768d, better quality, ~2× slower) or bge-small-en-v1.5 (384d, comparable). The plan's model-agnostic design (model identifier and dimensionality in _metadata table, model change triggers re-embedding) supports swapping without structural changes.

Implementation: The ort crate loads the ONNX model at startup into a shared OrtSession. The tokenizers crate (HuggingFace, pure Rust) handles tokenization. Both are kept warm in memory (~100--150 MB RSS). Inference runs on a dedicated rayon thread pool, with batching at the ONNX level (batch size 32--64 is the sweet spot for CPU).

4. Architecture Layers

  • Layer 1 --- Ingest & Parse: Detect language, run tree-sitter, build the complete graph (nodes + edges) eagerly. No LLM involved.

  • Layer 2 --- Semantic Enrichment: Run Roslyn or TS Language Service over ASTs to produce compiler-grade call edges, type info, override relationships, and reference counts.

  • Layer 3 --- Embedding: Generates eager embeddings at index time from names, signatures, and doc comments using a local ONNX model. Populates sqlite-vec for semantic search from day one.

  • Layer 4 --- Retrieval Pipeline: Accepts a query, produces candidates via three parallel channels (vector, BM25, qualified-name), merges/reranks, assembles bounded context.

  • Layer 5 --- MCP Server: Exposes the index engine and file system as MCP tools. Any MCP-compatible client (LLM agents, desktop apps, IDE extensions) connects to the same server. See MCP_SERVER_SPEC.md for the full tool specification.

4.5. Concurrency Architecture

The application is a single Rust process running on tokio. All local computation --- parsing, indexing, embedding, and retrieval --- executes within this process. The MCP server runs in the same process, exposing tools to external clients. The engine makes no network calls.

4.5.1 Thread and Task Map

The Rust process maps components to the following execution contexts:

  • Main thread: Reserved for Tauri's IPC with the webview. Never blocks. All Tauri commands return quickly or spawn work onto other executors.

  • File watcher (tokio task): Uses the notify crate, which emits events into a tokio mpsc channel. A dedicated task acts as the write coalescer: collects events during the debounce window, deduplicates by path, and emits ChangeBatch to the ingest pipeline. Implements burst recovery (>100 files in 30s → extend collection window).

  • Ingest pipeline (rayon thread pool): Receives ChangeBatch from the coalescer. Tree-sitter parsing and embedding generation are CPU-bound and run on a rayon pool (or tokio::task::spawn_blocking). After parsing, semantic enrichment requests are sent to child processes via async channels.

  • SQLite writer (dedicated single thread): Owns an exclusive rusqlite::Connection. Receives write batches via a bounded mpsc channel --- this is the write queue with backpressure. When the channel is full (default depth 10 batches), the ingest pipeline blocks, propagating backpressure to the file watcher. The single-writer thread eliminates SQLite-level write contention entirely.

  • SQLite reader pool: A connection pool (r2d2 + rusqlite or deadpool) of read-only connections. WAL mode allows concurrent reads alongside the single writer. The retrieval pipeline and MCP tool calls all read simultaneously.

  • Embedding module (in-process, rayon): ONNX Runtime session loaded at startup, shared across rayon threads. Batches of (text, node_id) tuples are processed on the rayon pool. No IPC or network overhead.

  • MCP server (tokio task): Handles incoming MCP tool calls from connected clients. Read-only tools execute directly against the SQLite reader pool. Write operations (e.g., index_files) are routed through the writer channel.

  • Child process supervisors (tokio tasks): One per language service (Roslyn, TS LS). Each supervisor owns spawning, health checks (periodic RSS via /proc/[pid]/status or platform equivalent), restart-on-crash, and idle timeout. Communication is async JSON-RPC over stdio.

4.5.2 WAL Concurrency Strategy

In WAL mode, readers do not block writers and writers do not block readers. The single-writer thread (Section 4.5.1) eliminates write-side contention entirely. Read-write concurrency is managed through transaction discipline and checkpoint control rather than a custom scheduling mechanism:

  • Keep write transactions small and frequent (bounded batch size, default 500 nodes per transaction). This minimizes WAL file growth and ensures checkpoints complete quickly.

  • Control checkpoints explicitly: set PRAGMA wal_autocheckpoint to a higher threshold (default 10000 pages) to avoid mid-query checkpoint stalls, and schedule explicit PRAGMA wal_checkpoint(PASSIVE) during idle periods.

  • Measure p95 query latency under write load before introducing any custom scheduling. If metrics later show contention, introduce targeted solutions (checkpoint policy tuning, transaction sizing, or statement/index optimization) rather than a bespoke pause/resume mechanism.

4.5.3 Child Process Management and IPC Protocol

Roslyn (.NET CLI tool) and TS Language Service (Node.js) run as isolated child processes managed by the Rust core. They communicate via JSON-RPC over stdio (tokio::process::Command with piped stdio + a JSON-RPC codec). Key lifecycle rules:

  • Load on demand: only for languages present in the repository.

  • Keep warm after initial index. Idle timeout (default 15--30 min, configurable) unloads to free memory.

  • Memory watchdog: configurable threshold (default 2 GB). If a child's RSS exceeds the threshold, the supervisor forcefully restarts it. Critical: pass --max-old-space-size=2048 to the TS LS child process.

  • Crash isolation: if a child crashes, the supervisor restarts it. The parent Rust process is never affected.

  • Most read-only queries never require a language service --- the graph and indexes are fully queryable without them.

IPC Protocol Contract: The JSON-RPC protocol between the Rust core and language service child processes must be explicitly versioned and support cancellation and backpressure:

  • Handshake: On startup, the child process and Rust core exchange protocol_version, capabilities (e.g., supported edge types, incremental update support), and extractor_version. Version mismatch triggers a controlled restart or fallback to syntactic-only mode.

  • Request lifecycle: Every request carries a unique request_id, a deadline (configurable, default 30s for per-file analysis, 120s for solution/project loading), and supports cancellation. The Rust side may cancel in-flight requests when a newer file change supersedes them.

  • Backpressure: When the Rust side's ingest pipeline is saturated (write queue at capacity), it stops sending new enrichment requests to the child process. The child process must handle a bounded request queue (default 64 pending requests) and reject with a backpressure error code when full.

  • Structured error taxonomy: Errors use a defined code set: semantic_unavailable (project failed to load), project_load_failed (MSBuild/tsconfig error with diagnostic details), plugin_blocked (safe mode prevented plugin loading), timeout (deadline exceeded), oom_restart (child restarted due to memory watchdog). These codes drive parse_status assignment, user-facing warnings, and structured logging.

4.5.4 Rust Crate Dependencies

Key crate choices for the Rust process:


Crate Purpose

rusqlite (bundled feature) SQLite access. Bundled avoids system SQLite version issues.

notify Cross-platform file system watcher.

tree-sitter + language crates Syntactic parsing with first-class Rust bindings.

tokio Async runtime for tasks, channels, IPC, and network I/O.

rayon CPU-bound parallelism for tree-sitter batch parsing and embedding.

ort ONNX Runtime bindings for local embedding inference.

tokenizers (HuggingFace) Pure-Rust tokenizer for the embedding model.

reqwest HTTP client for .NET backend communication.

r2d2 / deadpool Connection pool for read-only SQLite connections. Ensure the selected pool crate supports !Send connections correctly (e.g., deadpool-sqlite or r2d2_sqlite). Apply required PRAGMAs on connection creation/acquire.


4.5.5 Tauri Command Boundary

Every UI-initiated action (search query, node inspection, manual re-index) enters as a Tauri command, acquires a read connection from the pool, and executes. The frontend never talks to SQLite directly. Request cancellation: if the user types a new search query, the previous task is dropped via tokio's cooperative cancellation.

5. Graph Schema (SQLite)

The graph is the backbone. Built eagerly and completely at index time. Embeddings are generated eagerly at index time (from names + signatures) so that semantic search works from day one (see §6.3).

5.1 Node Types


Node Type C# Examples TypeScript Examples

module Namespace, Assembly Barrel index.ts (re-exports only)

project .csproj, assembly package.json workspace, Nx/Turborepo project

class class, record, struct class, abstract class

interface interface interface

method method, constructor, getter/setter method, function, arrow fn, constructor

type enum, delegate, type alias type alias, enum

component --- React functional/class component

file .cs source file .ts, .tsx source file

property property declaration class field, getter/setter

constructor class/static constructor class constructor


The project node type represents a logical project or package boundary within a repository. In C#, it corresponds to a .csproj file and its contained source files. In TypeScript, it corresponds to a package.json workspace or a monorepo project (Nx, Turborepo, Lerna). Project nodes sit between the repository root and module/file nodes in the contains hierarchy, enabling queries like "show me all external API calls in the OrderService project." Project detection is heuristic: scan for .csproj files (C#) or package.json files with a workspaces field or under a configured monorepo tool's project directory pattern.

5.2 Node Fields

(eager) = populated at index time from parser/compiler.

Identity

  • node_id (eager): UUID v4 surrogate primary key across all three SQLite indexes. Internal-only; used for joins and FK references. Not relied upon for logical identity matching across refactors. Storage optimization: Store node_id, file_id, and edge endpoint columns as BLOB(16) instead of TEXT UUIDs for DB size and join performance. At 100K nodes with millions of edges, this materially shrinks edge indexes and improves cache locality. Keep UI/API representations as canonical UUID strings; convert at the boundary. This is a low-risk optimization best applied early before shipping migrations to users.

  • symbol_key (eager): Deterministic logical identity key, separate from node_id. Used as the primary matching key during incremental updates to preserve cached embeddings across moves and refactors where the symbol identity is unchanged (same name, same signature). Not stable across renames --- since both C# and TS key derivations include the symbol name, a rename changes the symbol_key. Rename preservation is deferred to Phase 2b (similarity heuristics; see below). In Phase 1, a rename produces a new node (old node hard-deleted; snapshot written to deletion journal for Phase 2b rename matching). Derived per language from compiler symbol identity (best-effort):

    • C#: Fully qualified metadata name + symbol kind + containing type + parameter types + generic arity. Derived from Roslyn's ISymbol.ToDisplayString(SymbolDisplayFormat.FullyQualifiedFormat) with parameter type list appended for methods. Phase 1 fallback (tree-sitter only, no Roslyn): qualified_name + symbol kind + parameter count + parameter types-as-written (extracted from tree-sitter parameter nodes) + generic arity. This ensures overloaded methods produce distinct identity keys even without Roslyn type resolution. The types-as-written may not match Roslyn's fully-resolved types, so when Roslyn becomes available (Phase 2a), the system must reconcile identities. Identity reconciliation (Phase 2a): When the semantic enrichment pass produces a Roslyn-derived symbol_key that differs from the existing tree-sitter-derived key for the same symbol, the system MUST mutate symbol_key in-place on the existing node_id (preferred for cache stability). This preserves cached embeddings and all edges that reference the node_id. The reconciliation runs within the semantic enrichment write transaction: (1) match the Roslyn symbol to the existing node by file location + symbol kind + name; (2) update symbol_key and symbol_disambiguator on the existing node; (3) verify no UNIQUE constraint violation with the new key. If a UNIQUE conflict occurs (another node already claims the new key), log a warning and fall back to delete+create with a node_identity_map entry. The same in-place mutation mechanism applies to TS export_scope upgrades (module→package) in Phase 2a.
    • TypeScript: For exported symbols, symbol_key MUST be stable across file moves when the symbol identity is unchanged. For non-exported file-scoped symbols, symbol_key includes file_id and is not expected to be stable across moves (see exported vs non-exported rules below). Export scope classification: Each TS symbol is classified into one of three export_scope values that determine its identity derivation:
      • export_scope = package: The symbol is reachable from a stable package entrypoint or barrel export surface (i.e., it can be traced through re-exports to a package.json main/exports field or a barrel index.ts). These symbols have fully move-stable identity.
      • export_scope = module: The symbol has an export keyword (or is a default export) but is NOT reachable from the package entrypoint. Ordinary module-level exports fall here. These are common in large codebases (e.g., export default, export function parse() in internal utility files). Identity includes the file-level discriminator (file_id) because two sibling modules within the same package can legitimately export the same name and signature.
      • export_scope = file (non-exported): The symbol has no export keyword and is not re-exported. File-scoped only. Identity derivation per scope:
      • package scope: symbol_key = package_id + package-level export path (the stable path through the export graph from the package entrypoint to the symbol, e.g., @mylib/utils.parse or @mylib.default) + symbol kind + normalized signature (parameter count and, when available, normalized parameter types). The package-level export path is the key discriminator: it is stable across file moves (because re-exports are updated to follow the symbol) and unique within the package (because two symbols cannot occupy the same export path from the same entrypoint). export default is represented as the export name default qualified by its package-level path, ensuring that default exports from different modules resolve to different symbol_key values when they have different package-level export paths, and to the same symbol_key when they share the same path (which is correct — they are then the same logical export). Move-stable; file path MUST NOT appear in symbol_key.
      • module scope: symbol_key = package_id + file_id + local export name + symbol kind + normalized signature. Includes file_id because module exports are not package-unique (two files can both export function parse() with identical signatures). NOT move-stable — a file move changes file_id and therefore symbol_key. In Phase 1, treated as delete+create on file move; Phase 2b rename detection can recover node_id.
      • file scope (non-exported): symbol_key = package_id + file_id + local name + symbol kind + normalized signature. Same as module scope — includes file_id, not move-stable. Phase 1 vs Phase 2a scope classification: In Phase 1 (tree-sitter only), the system cannot trace the export graph through barrels and re-exports. All symbols with an export keyword are classified as export_scope = module (conservative: includes file_id). In Phase 2a, the TS Language Service resolves re-exports, and symbols reachable from package entrypoints are upgraded to export_scope = package (identity key mutated in-place on the existing node_id to preserve cached data — see C# reconciliation note below for the general mechanism). This means Phase 1 does not achieve move-stability for any TS symbols, but no correctness invariant is violated (symbols are merely treated as delete+create on move, with cache loss). Phase 2a upgrades the identity keys and enables move-stability for package-exported symbols. Store decl_file_path as metadata (non-identity) to support navigation and diagnostics. Falls back to file-relative qualified name from tree-sitter. Collision handling: If collisions still occur among package-scoped exported symbols (possible with declaration merging and overloaded function signatures — rare, but possible in large monorepos), assign a symbol_disambiguator field. The disambiguator uses a declaration-header hash (hash of the symbol's declaration line: name, parameter list shape, and return type, but not the body) so that routine body edits do not change the identity key. The declaration-header hash is the sole disambiguation scheme for package-scoped exported symbols; file path MUST NOT be used as a secondary disambiguator for these symbols because it would break move-stability. The disambiguator participates in the uniqueness constraint and in incremental identity matching via the full key (language, project_id, symbol_key, symbol_disambiguator).
    • Invariant: For C# symbols and TS package-scoped exported symbols, a pure file move/rename MUST NOT change symbol_key. Cache retention across moves relies on symbol_key matching first; fingerprint similarity remains fallback. For TS module-scoped and file-scoped symbols, a file move changes symbol_key (because file_id is part of the key); these are treated as delete+create in Phase 1, with rename detection available in Phase 2b.
    • Secondary unique index on (language, project_id, symbol_key, symbol_disambiguator). The project_id discriminator prevents collisions in monorepos; the symbol_disambiguator (TEXT NOT NULL DEFAULT '' --- empty string when no collision exists, never NULL) resolves remaining ambiguities from declaration merging, re-exports, and overloaded signatures without breaking the uniqueness constraint. Using NOT NULL DEFAULT '' instead of a nullable column is critical because SQLite treats each NULL as distinct in UNIQUE indexes, which would silently allow duplicate identity keys.
    • When symbol_key matching fails (e.g., rename, or simultaneous rename + signature change), Phase 1 treats this as a new symbol (old node hard-deleted with journal entry, new node created). Phase 2b adds two-tier heuristic rename detection: (1) git diff --find-renames between last indexed commit and HEAD; (2) chunk_fingerprint similarity (≥80% Jaccard). If a match is found, the existing node_id is reused, preserving cached data. See Phase 2b deliverables for full specification.
  • display_name (eager): Short unqualified name (e.g., Authenticate).

  • qualified_name (eager): Fully qualified name (e.g., MyApp.Auth.AuthService.Authenticate).

  • node_type (eager): Enum: file | module | project | class | interface | method | property | constructor | type | component.

  • language (eager): csharp | typescript.

  • project_id (eager): FK to the containing project node. Part of the uniqueness constraint (language, project_id, symbol_key, symbol_disambiguator). In monorepos, prevents collisions between identically-named symbols in different packages. Non-null: during initial indexing before project detection completes, nodes are assigned to a synthetic "repo-root" project per language (created automatically). This avoids NULL in the UNIQUE constraint, which would otherwise allow duplicate symbol_keys (SQLite treats each NULL as distinct in UNIQUE indexes). Assignment sequencing: Project detection (scanning for .csproj, package.json workspaces, etc.) MUST complete before symbol node indexing begins, so that nodes are assigned to their correct project_id from the start. If project detection cannot complete (e.g., malformed project files), nodes are assigned to the synthetic "repo-root" project and a warning is emitted. Late reassignment of project_id after initial indexing (e.g., when a project file is fixed) requires re-indexing the affected symbols to update their identity keys, since project_id participates in the UNIQUE constraint.

Location

  • file_id (eager): FK to the primary file node, ON DELETE SET NULL. When a file is deleted, the writer logic deletes all node_spans rows for that file; if a node has no remaining spans, the node itself is deleted explicitly (which triggers ON DELETE CASCADE for edges). This avoids cascade-deleting multi-span nodes (e.g., C# partial classes) when only one of their files is removed. For single-file symbols, the writer deletes the node when its sole span is removed. For multi-span nodes (partial classes, TS declaration merging), this references the first declaration file. Transactional requirement: The SET NULL on file_id, node_spans cleanup, conditional node deletion, and cascading edge deletion must all execute within a single write transaction to prevent observable intermediate states (e.g., a node with NULL file_id and no spans that still has edges).

  • file_path (eager): Relative path from repo root for the primary file. Denormalized for query convenience.

  • line_start / line_end (eager): 1-based line numbers for the primary span. line_count = line_end - line_start + 1.

  • Multi-span support: Nodes that span multiple files or discontinuous ranges (C# partial classes, TS declaration merging) are represented via a normalized node_spans table:

<!-- -->
  • node_spans(span_id INTEGER PRIMARY KEY, node_id BLOB(16) FK, file_id BLOB(16) FK ON DELETE CASCADE, file_path TEXT, line_start INTEGER, line_end INTEGER, span_hash BLOB(32), is_primary BOOLEAN). The ON DELETE CASCADE on file_id ensures that when a file node is deleted, all node_spans rows referencing that file are automatically removed. This works in concert with the explicit deletion algorithm (see §6.5) which processes spans and conditional node deletion before deleting the file node, but the CASCADE acts as a safety net if the file node is deleted directly. Exactly one primary span per node is enforced by a partial unique index: CREATE UNIQUE INDEX idx_node_spans_one_primary ON node_spans(node_id) WHERE is_primary = 1.
<!-- -->
  • The nodes table retains file_id, file_path, line_start, line_end as convenience fields pointing to the primary span (is_primary = true). Source retrieval and invalidation logic must consult node_spans for multi-span nodes. get_source(node_id) supports three modes: "primary span" (default), "all spans", and "specific span_id". Invalidation checks span_hash per span --- only changed spans trigger re-processing.

Structure & Modifiers

  • access_modifier (eager): public | private | protected | internal | protected_internal (C#); exported | unexported (TS).

  • is_public_api (eager): Computed boolean indicating whether the symbol is part of the public API surface. Derived per language:

    • C#: public or protected access modifier, respecting assembly-level InternalsVisibleTo (configurable).
    • TypeScript: Phase 1 (tree-sitter): symbol has an export keyword (direct exports only; re-exports via barrel files are not resolved). Phase 2a: upgraded to include re-exports resolved during barrel file detection. Symbol is exported from its module, including re-exports via barrel files. Used by the reranker as a relevance boost for public-facing symbols.
  • is_static, is_abstract, is_async, is_override, is_deprecated, has_doc_comment (eager): Boolean flags.

  • is_external_api (deferred): True if the symbol invokes HTTP clients, DB contexts, message queues, or FS I/O. In Phase 1 this field is always false. Deferred rationale: the reranker gives this signal only +0.05 additive boost (the smallest in the system), the heuristic inference from callee signatures requires maintaining a pattern list with false positives/negatives, and the root LM can identify external API calls from source and summaries when it inspects nodes. If the eval harness later identifies external API identification as a significant retrieval gap, add it then with better data on which patterns matter. The schema column is reserved (nullable boolean) so no migration is needed when implemented. User-configurable patterns under external_apis in .codeagent/config.json also deferred.

  • parse_status (eager): Enum: full | syntactic_only | failed. full = tree-sitter + semantic pass succeeded. syntactic_only = tree-sitter only, edges approximate. failed = unparseable, file node exists with no child symbols. Inherited by contained symbols for retrieval weighting.

  • generated (eager): True if file matches a generated-code glob pattern (configurable). Inherited by contained symbols. ×0.3 retrieval penalty.

Signature (methods/functions/constructors)

  • return_type (eager): String representation. Enables queries like "find all methods returning IEnumerable".

  • parameter_count (eager): For overload disambiguation.

  • parameter_signature (eager): Compact param types string. Truncated at 500 chars.

Graph Metrics

  • reference_count (eager): Locations referencing this symbol. Phase 1 (tree-sitter only): populated from inbound calls + references edge count (approximate, may undercount). Phase 2a (semantic enrichment): upgraded to precise count from Roslyn/TS findReferences. Kept as a stored field because it is used in the reranker for scoring and only changes when the referring file is re-enriched (already in the write path). Computing on demand would add latency to retrieval.

  • caller_count (dynamic): Distinct symbols with a calls edge to this node. Computed at query time via: SELECT COUNT(DISTINCT source_id) FROM edges WHERE target_id = ? AND type = 'calls'. With index on (target_id, type), resolves in low milliseconds. Not stored --- eliminates write amplification during incremental updates.

  • callee_count (dynamic): Distinct symbols this node calls. Computed at query time via: SELECT COUNT(DISTINCT target_id) FROM edges WHERE source_id = ? AND type = 'calls'. Not stored.

  • subclass_count (dynamic): Nodes with inherits/implements edges to this node. Computed at query time. Not stored.

  • calling_file_count (dynamic): Distinct files containing at least one call to this node. Not stored --- computed at query time via: SELECT COUNT(DISTINCT file_id) FROM edges JOIN nodes ON edges.source_id = nodes.node_id WHERE edges.target_id = ? AND edges.type = 'calls'. Eliminates write amplification during incremental updates. With index on (target_id, type), resolves in low milliseconds.

Provenance

  • commit_hash (eager, nullable): Git commit at last index. Null if the repo is not a git repo or git is unavailable.

  • chunk_hash (eager): Strong cryptographic hash (SHA-256) of raw source span for single-span nodes. For multi-span nodes, this field stores a composite content_hash: SHA-256 of the concatenation of all span hashes (from node_spans, ordered by span_hash bytes ascending --- not by span_id, which is an insertion-order surrogate and not stable across reparse). This ordering is deterministic, path-independent, and stable across delete/reinsert cycles. chunk_hash is purely content-derived. Used exclusively for exact source-change detection (changed yes/no). Not suitable for similarity comparison.

  • chunk_fingerprint (deferred to Phase 2b): Locality-sensitive fingerprint of normalized source tokens for approximate similarity comparison. Required for rename/move detection. Not implemented in Phase 1 --- renames are treated as delete + create (cache miss, not correctness failure). See Phase 2b for full specification.

  • last_modified_at (deferred to Phase 2b): Timestamp of last commit touching this file. Exists primarily to support git-based rename detection. Nullable. Not required for Phase 1 correctness.

  • embedding (eager): Embedding from qualified_name + signature. Never null after initial index. Model identifier and dimensionality recorded in _metadata table; model change triggers full re-embedding.5.3 Edge Types


Edge Type Meaning Source

calls A invokes B Roslyn/TS LS (precise) + tree-sitter (approx fallback)

inherits A extends B tree-sitter (both languages)

implements A implements interface B tree-sitter + Roslyn/TS for resolution

imports File/module A imports B tree-sitter import/using statements

overrides A overrides virtual B Roslyn / TS Language Service

references A reads field/property of B Roslyn / TS Language Service

contains Module contains class, class contains method tree-sitter scope hierarchy

accepts React component A accepts props interface B TS adapter component detection

extends C# extension method A extends type B Roslyn (C# adapter)


Edge confidence: All edges carry a confidence field (exact | probable | approximate). Roslyn/TS LS-resolved = exact. Tree-sitter fallback = approximate. Structural typing matches = probable. The reranker penalizes traversal paths with approximate links.

Edge evidence: Phase 1 simplification: edges do not carry a weight (callsite count) field. An edge exists or it doesn't. The node-level reference_count already captures usage frequency for reranking. Deferred: edge weight (integer, callsite count per source-target pair) can be added in a later phase if "sort callers by usage" UX is needed. The migration is trivial: add a nullable integer column. Provenance is tracked via extractor_version (string, e.g., "roslyn-1.0" or "treesitter-0.22") to identify edges that need recomputation after adapter upgrades.

5.4 File Nodes

File nodes are first-class participants in the graph, vector index, and FTS5. Every symbol node holds a file_id FK to its file node. File-specific fields: node_id, node_type (always file), file_path, file_name (bare filename for fast FTS5 search), language, line_count, commit_hash, chunk_hash, last_modified_at, embedding. Note: chunk_fingerprint added in Phase 2b when rename detection is implemented. Connected to contained symbols via contains edges.

5.5 Language-Specific Considerations

C#: Partial classes merged into single node with multiple entries in the node_spans table (one span per file fragment). The nodes table file_id / file_path / line_start / line_end fields reference the primary declaration (first fragment by file path sort order). Invalidation checks each span's span_hash independently --- changing one fragment triggers re-processing of that span only. Getter/setter modeled as separate method nodes. Extension methods have an extra 'extends' edge. Async flagged with is_async; no structural difference.

TypeScript/React: In TypeScript, every source file has a file node (same as C#). The module node type is used exclusively for barrel files (index.ts with re-exports only), which are modeled as pass-through module nodes; edges bypass barrel modules to reference actual definitions. Regular TS source files do not get a separate module node --- the file node serves as the container for contains edges to its symbols. React components identified heuristically (returns JSX, capital-letter name, or React.FC annotation); tagged node_type = component with 'accepts' edge to props interface. Dynamic dispatch edges marked with confidence levels. TSX handled natively by tree-sitter-typescript. Declaration merging cases (e.g., interface augmentation across files) use the node_spans table, following the same pattern as C# partial classes. Phase 1 documents this as a known limitation for uncommon application- level declaration merging; library typing augmentations are excluded from indexing by default.

6. Ingest Pipeline

6.1 Language Detection

Classification by extension: .cs = C#; .ts/.tsx = TypeScript. Unknown extensions skipped. Glob-based exclusions configurable (e.g., *.generated.cs, node_modules/).

Symlink and junction guard: File discovery and the file watcher must not follow symbolic links or NTFS junctions into directories outside the repository root. Every discovered path is validated by resolving its real path (canonicalize) and confirming it is a descendant of the repo root. Paths that resolve outside the repo root are silently skipped and logged at debug level. This prevents both privacy issues (indexing unrelated directories) and performance issues (following symlinks into large dependency trees or circular link structures). An optional configuration key indexing.follow_symlinks (boolean, default false) allows users to explicitly opt in to indexing symlinked content within the repo.

Generated code and partial classes: Files matching generated-code glob patterns (e.g., *.g.cs, *.Designer.cs) are flagged with generated = true. When a generated file contains a partial class fragment that merges with an authored partial class, the authored node retains generated = false but the generated span in node_spans is tagged. The retrieval penalty (×0.3) applies only to purely generated nodes, not to authored nodes that happen to have generated partial fragments.

6.2 Graph Construction (Eager)

For each file: run tree-sitter → walk AST with language adapter to emit nodes/edges → (Phase 2a+) run semantic pass (Roslyn/TS LS) to enrich edges → normalize all paths to POSIX format before any hashing or storage → upsert nodes/edges. In Phase 1, the pipeline stops after tree-sitter parsing (no semantic enrichment); edges are approximate and parse_status = syntactic_only. When chunk_hash is unchanged, skip expensive recompute (re-embedding, semantic re-enrichment) but still update location and provenance fields (file_id, file_path, line_start, line_end, node_spans file associations). This ensures that pure file moves update navigation metadata without discarding cached embeddings. Result (Phase 1): complete syntactic graph (edges approximate, parse_status = syntactic_only). Eager embeddings are produced immediately after graph construction (see §6.3). Fast --- no LLM calls, no network I/O.

Path normalization (critical on Windows). All paths stored in the database are repo-relative POSIX paths (forward slashes, no drive letter, no repo root prefix). A single normalization function normalize_path(absolute_path, repo_root) -> String strips the repo root prefix, converts backslash to forward slash, and returns the relative POSIX path. This function is called at every system boundary where paths enter: file watcher events (notify crate), tree-sitter adapter output, IPC responses from child processes (Roslyn/TS LS), and user input (debug CLI, config globs). Paths are NEVER normalized mid-pipeline --- only at the boundary. All internal lookups, hash computations, and comparisons operate on the already-normalized repo-relative POSIX form. This eliminates the class of bugs where a path comparison fails because one side has Windows separators.

Semantic pass architecture (Phase 2a+): The semantic pass runs against long-lived project contexts, not stateless per-file analysis:

  • C# child process: When safe mode is disabled and MSBuild evaluation is permitted, loads the solution via MSBuildWorkspace (or project files explicitly) and maintains an incremental Solution object in memory. On file change: applies a text update to the document via Solution.WithDocumentText(), then re-runs symbol queries for affected documents/projects. This avoids reloading the entire compilation on each edit. When safe mode is enabled (default), C# semantic enrichment falls back to syntactic_only (tree-sitter) as specified in Safe Indexing Mode.

  • TS child process: Uses the Language Service with script snapshots. Maintains per-package tsconfig.json programs for monorepos (one TS program per workspace package). On file change: updates the script snapshot, which triggers incremental re-checking of affected files only.

Partially broken project graph: When the repository does not build (common during active development), the semantic pass operates in degraded mode:

  • C# (Roslyn): Compilation diagnostics are inspected. If errors are limited to specific files/projects, semantic enrichment proceeds for unaffected projects. Affected files fall back to syntactic_only with tree-sitter edges. If the entire solution fails to load (e.g., missing SDK, broken project references), all files are indexed as syntactic_only and a user-visible warning is emitted with diagnostic details.

  • TS (Language Service): Per-package programs isolate failures. A broken tsconfig.json in one package does not prevent semantic enrichment of other packages. Files in broken packages fall back to syntactic_only.

Graceful degradation: If semantic pass fails, tree-sitter graph preserved with lower-confidence edges. parse_status set accordingly. Retrieval prefers fully-parsed nodes when relevance scores are close.

Process isolation: Roslyn and TS Language Service run in isolated child processes. See Section 4.5.3 for full lifecycle rules (memory watchdog, crash isolation, idle timeout, IPC protocol). Critical: pass --max-old-space-size=2048 to the TS LS child process to prevent V8 from exhausting system memory before the watchdog intervenes.

Safe Indexing Mode (security-critical): Semantic enrichment can unintentionally execute arbitrary code from the repository or its dependencies. This is the single biggest platform risk in the indexing pipeline because it turns "indexing" into "running code." Two specific vectors must be mitigated:

  • C# / Roslyn via MSBuildWorkspace: Design-time builds and MSBuild evaluation can run custom tasks/targets. Opening an untrusted .csproj can trigger NuGet restores and build logic that executes arbitrary code. C# Safe Mode (default enabled): When indexing.safe_mode = true, the system MUST NOT load projects via MSBuildWorkspace or any mechanism that evaluates MSBuild targets/props. In safe mode, C# semantic enrichment runs in one of the following explicitly supported modes (configurable, default syntactic_only): (1) syntactic_only: tree-sitter indexing only; parse_status = syntactic_only. (2) restricted_semantics (optional future enhancement): limited Roslyn analysis without MSBuild evaluation; clearly documented gaps (incomplete references/type binding) and parse_status = syntactic_only unless coverage thresholds are met. Do not claim "compiler-grade precision" in safe mode unless MSBuild evaluation is enabled and successful. Unsafe Mode (user opt-in): When indexing.safe_mode = false, C# semantic enrichment MAY use MSBuildWorkspace. The UI MUST display a blocking warning: "Indexing this repository may execute build logic from project files (MSBuild). Proceed only if you trust this repository." Users may optionally "trust" a repo (allowlist by repo fingerprint) to avoid repeated prompts. NuGet restore gating: Even in unsafe mode, NuGet package restore (which downloads and executes arbitrary packages) SHOULD be gated separately from project evaluation. MSBuildWorkspace.OpenSolutionAsync triggers a restore by default; the system should support a mode that allows MSBuild evaluation of local project files while blocking network restore, providing a useful middle ground between full safe mode and fully unrestricted loading. Configuration: indexing.allow_nuget_restore (boolean, default false when safe mode is off).

  • TypeScript Language Service plugins: tsserver can load language service plugins specified in tsconfig.json, and those plugins are arbitrary Node modules (code execution). Mitigation: do not use tsserver directly. Instead, instantiate the TypeScript Language Service API programmatically with a sanitized configuration that strips/ignores the plugins section from tsconfig.json. This eliminates the plugin execution vector while preserving full type resolution, call graph analysis, and module resolution.

Safe mode is a first-class requirement, not a future enhancement. The safe-mode toggle, plugin stripping logic, and user warning UI must be designed during Phase 1 (as part of the IPC protocol contract) and implemented in Phase 2a alongside semantic enrichment. Configuration: indexing.safe_mode (boolean, default true) in .codeagent/config.json.

Memory footprint: Roslyn: 200 MB--2 GB depending on solution size. For solutions >500K LOC, support project-level partitioning as fallback. TS LS: 100 MB--1 GB; per-package loading recommended for large monorepos. Both subject to a configurable memory watchdog (default 2 GB threshold) that forcefully restarts the child process.

Initial index at scale: Tree-sitter parses 1M LOC in seconds. Roslyn/TS LS semantic model: 30--60s. Combined with eager embedding for 50--100K nodes, total initial index for 1M LOC = minutes. A progress indicator (current phase + % complete) is required.

6.3 Eager Embedding (At Index Time)

After each node is added, generate a lightweight embedding using maximally discriminative input. Eager embedding input MUST avoid constant boilerplate that collapses vector space similarity. Recommended eager embedding text (versioned): qualified_name + node_type + normalized signature (parameter_signature, return_type) + doc comment (if present, truncated) + optional containing type/module name (one hop). For file nodes (which have no signature or containing type): file_path + file_name + language + import/export summary (top-N imported/exported symbol names, extracted from tree-sitter) + doc comment from file header (if present). This produces discriminative vectors that distinguish files by their role and dependency surface rather than collapsing all files into a single region of vector space. This produces discriminative vectors "from day one" without any LLM inference. Uses the local ONNX embedding model (see Section 3.1 for model selection rationale and performance characteristics). Populates sqlite-vec immediately so semantic search works from day one.

Cold-start limitation: Eager embeddings capture structural identity (names, signatures, types) but not behavioral intent. For behavioral queries (e.g., "what handles rate limiting"), vector recall will be lower; these rely primarily on BM25 (which matches behavioral terms in doc comments and identifiers) and graph traversal.

Critical: embeddings MUST use a consistent model and normalization strategy. Model identifier and dimensionality recorded in _metadata table. Model change triggers background bulk re-embedding.

Batching: Configurable batch size (default 64, optimized for CPU ONNX inference). Process in batches during initial indexing. Report progress per-batch.6.5 Incremental Updates

  • Deleted file: The following steps execute within a single write transaction, in this exact order:

    1. Collect affected nodes: Query node_spans for all node_id values referencing the deleted file_id.
    2. Delete spans: Delete all node_spans rows referencing that file_id.
    3. Process affected nodes: For each affected node_id:
      • If the node has zero remaining spans (single-span node, or last file of a multi-span node): write a snapshot to the deletion journal, then hard-delete the node (ON DELETE CASCADE removes edges, vec_nodes, fts_nodes).
      • If the node still has spans in other files (e.g., a C# partial class): reassign the primary span (is_primary = true) to the earliest remaining span by file path sort order, and update the node's convenience fields (file_id, file_path, line_start, line_end) to match the new primary span.
    4. Journal and delete the file node: Write a snapshot of the file node to the deletion journal, then hard-delete the file node.

    Do not delete the file node first — nodes.file_id uses ON DELETE SET NULL, which would null-out convenience fields on all contained nodes before the writer has a chance to reassign or delete them, complicating "which nodes were affected?" logic. Do not use blanket file_path matching to delete nodes --- this would incorrectly delete multi-span nodes that still have active spans in other files.

  • Renamed/moved file (Phase 1): Within a ChangeBatch, creates and modifications are processed before deletes. The ingest pipeline first parses the new file, matches nodes by symbol_key via upsert on (language, project_id, symbol_key, symbol_disambiguator), and updates node_spans to reference the new file_id and file_path. Only then are deletes processed (remove orphaned spans; delete nodes only if no spans remain). This preserves node_id and cached embeddings for pure file moves without requiring Phase 2b rename detection. Phase 2b adds rename detection: git diff --find-renames (opportunistic) and chunk_fingerprint similarity ≥0.80 (fallback) will preserve node_id and cached data across moves. Git-based detection is opportunistic --- if git is unavailable, the repo is not a git repo, the clone is shallow, or the command fails, the system falls back to fingerprint similarity. Fields commit_hash and last_modified_at are nullable and not required for correctness.

  • Renamed symbol (Phase 1): Treated as old symbol deleted + new symbol created. Cached data on the old node is lost. Phase 2b adds similarity-based rename detection: same container, same node kind, same parameter arity/types, high body similarity (chunk_fingerprint ≥0.80), overlapping source span. When a match is found, the existing node_id is reused.

  • Uncommitted rename detection (deferred to Phase 2b): The file watcher will implement a short-lived rename correlation window in Phase 2b alongside chunk_fingerprint support. In Phase 1, IDE-driven renames that fire as delete + create are handled as two separate events (old path deleted, new path created).

  • Modified file: Reparse with tree-sitter + semantic pass. Compare chunk_hash per node. Changed nodes: update, regenerate eager embedding. Edges from changed nodes recomputed. Unchanged neighbors keep their caches.

  • New file: Full parse with no existing state.

Semantic-context invalidation (project/program changes). Certain changes can invalidate semantic edges without any source node body change (binding changes). The ingest pipeline MUST detect changes to semantic context files and trigger re-enrichment:

  • C# context files (minimum): *.sln, *.csproj, Directory.Build.props, Directory.Build.targets, global.json, nuget.config, packages.lock.json, Directory.Packages.props (if present).
  • TypeScript context files (minimum): tsconfig*.json, package.json, lockfiles (package-lock.json, yarn.lock, pnpm-lock.yaml), workspace tool configs (Nx/Turborepo/etc).

Action: Re-run semantic enrichment for the affected project/package scope (at least all files in that project/package). During re-enrichment, edges from prior semantic runs for that scope MUST be replaced atomically (delete prior semantic edges for affected scope + insert new edges in a single write transaction) to avoid mixed-staleness. This is the conservative approach; future optimization may narrow the impact set using the language service's dependency tracking (e.g., TS Language Service can identify which files are affected by a config change).

Note: This is a distinct invalidation class from source-driven invalidation. Both are required for correctness. The InvalidationPlanner (see below) must handle both classes.

Edge invalidation is source-driven and type-aware (not blanket-conservative). Behavioral edges are recomputed only when the source node changes, not when the target changes. Implement as a standalone InvalidationPlanner (expanded from EdgeInvalidationRuleEngine) that outputs a structured invalidation plan covering:

  • File-level reparse decisions
  • Semantic scope re-enrichment (project/package)
  • FTS/vector row rebuilds for affected nodes
  • Stale edge deletion for affected scope

The planner accepts ChangeType + EdgeType + optional SemanticContextChange and returns the invalidation action + resulting confidence. This centralizes correctness logic and makes it testable. Do not scatter if/else checks through the pipeline. Must be unit-testable against the decision matrix (Section 6.6).

Write coalescing: File watcher collects changes with a debounce window (default 2s, configurable). Batch processed in chunks of at most 500 nodes per SQLite write transaction (consistent with §4.5.2 bounded batch size), ensuring WAL growth stays bounded even for large batches. Burst recovery mode: >100 files in 30s window → collect for 30s more, process in chunked transactions, then revert to normal debounce.

Batch updates at scale: Large refactoring commits are processed in chunked transactions of at most 500 nodes each (not per-file, but also not as a single unbounded transaction). The exception is atomic semantic edge replace (§6.5), which uses one transaction per affected project/package scope. calling_file_count is dynamic, so no cascading metadata updates on called nodes. Scaling note for large scopes: For very large projects/packages (10K+ files), a single atomic edge replace transaction can produce substantial WAL growth and long write-lock duration. If this becomes a bottleneck, consider a staged replacement strategy: (1) write new edges to a staging table, (2) swap old edges for new edges in a single short transaction (DELETE old + INSERT FROM staging), (3) drop the staging table. This bounds the critical section to two bulk DML statements rather than thousands of individual inserts.

6.6 Edge Invalidation Decision Matrix

Edge invalidation is source-driven: behavioral edges (calls, references) are determined by the source (caller) node's body. A change to the callee's implementation does not require recomputation of edges originating from callers. This prevents "recomputation storms" during large refactors where many callees change but callers remain stable.


Edge Type Change Type Invalidation Action Resulting Confidence

calls Source node's chunk_hash changes Recompute via semantic pass exact if semantic pass; approximate if tree-sitter

calls Target node's body-only change No action (callee body doesn't affect call edge) Unchanged

calls Target node_id preserved (rename detected) No action (edges reference node_id, which is stable) Unchanged

calls Target node_id replaced (delete+create) ON DELETE CASCADE removes stale edges; callers re-establish on next semantic pass exact if semantic; approximate if tree-sitter

references Source node's chunk_hash changes Recompute via semantic pass exact if semantic pass; approximate if tree-sitter

references Target node's body-only change No action Unchanged

contains Child symbol added/removed/moved Recompute from AST Always exact

contains Method body change only No action Unchanged

inherits Subtype declaration header changes Recompute from declaration exact if semantic; approximate if tree-sitter

inherits Base type header or body change No action (base changes don't affect subtype edges) Unchanged

implements Implementing type's declaration header changes Recompute from declaration exact if semantic; approximate if tree-sitter

implements Interface header or body change No action Unchanged

overrides Overriding method's declaration header changes Recompute via semantic pass exact if Roslyn/TS; approximate otherwise

overrides Base method change only No action Unchanged

accepts Component or props interface changes Recompute via TS adapter exact if TS LS; probable if structural match

extends Extension method sig or target type Recompute via Roslyn exact if Roslyn; edge removed if fails

imports Import/using statements change Recompute from AST Always exact


Additional recomputation triggers (node_id lifecycle): When a target node's identity changes (rename or move), the outcome depends on whether node_id is preserved. If node_id is preserved (Phase 2b rename detection), edges are automatically valid. If node_id is replaced (Phase 1 delete+create), ON DELETE CASCADE removes stale edges; callers are re-enriched on their next semantic pass to establish edges to the new node_id.

Phase 1 behavior: When a symbol is renamed, the old node is hard-deleted (a snapshot is written to the deletion journal first; ON DELETE CASCADE removes all its edges), a new node is created, and caller edges pointing to the old target are recreated when callers are next re-enriched by the semantic pass. No edge remapping machinery exists in Phase 1.

Phase 2b behavior (edge remapping on identity changes): The system MUST NOT rely on ad-hoc remapping by symbol_key unless it has an explicit rename mapping. The preferred approach is node_id preservation: when a rename/move is detected (by git rename detection or fingerprint similarity), reuse the existing node_id, which automatically preserves all edges without remapping. When node_id cannot be preserved, use a rename mapping table: node_identity_map(language TEXT NOT NULL, project_id BLOB(16) NOT NULL, old_symbol_key TEXT NOT NULL, old_symbol_disambiguator TEXT NOT NULL, new_node_id BLOB(16) NOT NULL, new_symbol_key TEXT NOT NULL, new_symbol_disambiguator TEXT NOT NULL, detected_at TIMESTAMP NOT NULL, reason TEXT NOT NULL) and use it to update edges.target_id / edges.source_id where applicable. If no mapping exists, stale edges MUST be dropped on next semantic recomputation for the affected scope or during GC, and MUST NOT be silently "guessed" to a new target.

Semantic-context global trigger: If semantic context changes for a project/package (see Section 6.5, Semantic-context invalidation), recompute all semantic edges (calls, references, overrides, extends (C#), TS resolution-driven edges) within that scope, even if individual source chunk_hash values did not change. Prior semantic edges for the affected scope MUST be deleted and replaced atomically (single write transaction).

parse_status degradation rule: When a file degrades from full to syntactic_only, all exact edges from that file produced by the semantic pass are downgraded to approximate. When failed, all edges from contained symbols are removed (symbols themselves removed). Scope safety: Semantic edge updates (including degradation) are applied as: delete prior semantic edges for affected scope + insert new edges in one transaction. When the semantic child process restarts and returns partial results, the system must not mix old exact edges with new approximate ones --- the atomic replace applies to the entire affected scope.

7. Storage Layout

Single SQLite database file (.codeagent/index.db). Three extensions:

  • Plain SQLite tables: nodes, edges (source_id, target_id, type, confidence, extractor_version; both FKs with ON DELETE CASCADE; no weight field in Phase 1), node_spans (for multi-file/multi-span nodes; see Section 5.2), deletion_log (node snapshots written before hard-delete; queried by Phase 2b rename detection; swept periodically). The node_identity_map table is added in Phase 2b when rename detection is implemented. The nodes.file_id FK uses ON DELETE SET NULL (not CASCADE) to prevent multi-span node data loss --- see Section 5.2 for the explicit deletion logic. Multi-hop traversals via WITH RECURSIVE CTEs.

  • sqlite-vec: vec_nodes (embedding vectors). Approximate nearest-neighbor search.

  • FTS5: fts_nodes (name, qualified_name, parameter_signature, return_type). BM25 full-text search.

All three share node_id as join key. Retrieval can join across all in a single SQLite session. Graph traversals execute in <5ms at 100K nodes with indexes on (source_id, type) and (target_id, type).

7.1 Required PRAGMAs

Set on every connection (readers and writer):

  • PRAGMA foreign_keys = ON; --- without this, ON DELETE CASCADE silently fails.

  • PRAGMA mmap_size = <proportional to DB size>; --- default 256 MB, increase up to 2 GB for large codebases. Configurable under indexing.mmap_size.

  • PRAGMA busy_timeout = 5000; --- graceful wait on write contention.

Set by the writer thread (or DB creation path) only:

  • PRAGMA journal_mode = WAL; --- concurrent reads during writes. Read-only connections inherit WAL mode automatically once set.

  • PRAGMA synchronous = NORMAL; --- write speed boost, corruption-safe for a recoverable index.

  • PRAGMA wal_autocheckpoint = 10000; --- raise from default (1000 pages) to avoid mid-query checkpoint stalls. Explicit PRAGMA wal_checkpoint(PASSIVE) scheduled during idle periods (see Section 4.5.2).

7.2 FTS5 Configuration

Configure unicode61 with expanded tokenchars to preserve code identifiers. The correctly quoted SQL definition is: CREATE VIRTUAL TABLE fts_nodes USING fts5(name, qualified_name, parameter_signature, return_type, tokenize = 'unicode61 tokenchars ''_.$:@''', prefix = '2 3 4'). Without expanded tokenchars, AuthService.Authenticate is tokenized as three words, destroying exact-match precision. The $ character preserves TS/JS identifiers, : preserves C# and C++ namespace separators (e.g., C# global::MyApp.Core or C++ std::vector), and @ preserves C# verbatim identifiers. The prefix index (lengths 2, 3, 4) accelerates the wildcard queries used in query analysis (e.g., auth* OR service*); without it, prefix queries trigger a full FTS scan. Prefix indexes increase DB size --- measure the trade-off on a representative codebase, but for code search they are typically worth it. Phase 1 must include a test that asserts tokenization outcomes for representative identifiers (e.g., AuthService.Authenticate, $variable, global::MyApp.Core, @event). Note: ' and " are deliberately excluded from tokenchars because including them as token characters causes query parsing edge cases and escaping burden with little benefit.

Characters deliberately excluded from tokenchars: <, >, and , are not included because they would cause generic type signatures (e.g., IEnumerable<string>) to tokenize as single massive tokens, hurting BM25 recall for partial matches. Generic signatures are handled by the qualified-name channel instead.

Exact symbol lookup: FTS5 is treated as "best effort" for fuzzy text search. Exact symbol resolution uses the nodes table unique index on (language, project_id, symbol_key, symbol_disambiguator) --- this is the canonical internal identity path, not FTS5. For user-facing queries that provide a qualified_name rather than a full symbol_key, a secondary index on qualified_name supports lookup with overload disambiguation handled by the retrieval pipeline.

7.3 Vector Index Strategy

Flat (exact) search when node count <50,000. Auto-build HNSW index when exceeding threshold (configurable). At 1M LOC (50--100K+ nodes), HNSW is a practical Phase 1 requirement. Gating validation spike (Phase 1, hard prerequisite for Phase 4a): This spike is a gating deliverable that blocks Phase 4 retrieval if ANN is not viable. Verify that sqlite-vec supports HNSW creation and query with the Rust-bundled SQLite binary (rusqlite with bundled feature) across all target OSes. Acceptance criteria (must be defined before the spike begins): query latency at p95, insert throughput, index build time at 100K and 500K vectors, and memory overhead. If sqlite-vec cannot meet the defined acceptance criteria, Phase 4 MUST use the fallback embedded Rust HNSW library (e.g., hnsw_rs or instant-distance) with a mapping table (node_id → vector_offset/version) in SQLite, preserving the single-file architecture, with a clear durability/rebuild strategy. This decision must be made before Phase 4a retrieval work begins.

7.4 Index Integrity and Recovery

On startup, run PRAGMA quick_check (fast, validates structural integrity without full B-tree verification). If it reports corruption (power loss mid-write, disk full, etc.), delete the index file and trigger a full rebuild from source. Full PRAGMA integrity_check is reserved for crash recovery or user-triggered diagnostics, as it can take several seconds on 500 MB+ databases. Since the index is fully derivable from the repository's source files, no data is lost --- only cached embeddings require regeneration. Display an estimated rebuild time to the user based on repository size (heuristic: ~1 min per 100K LOC for tree-sitter + semantic + eager embedding). Log the corruption event for diagnostics.

Database maintenance: Run PRAGMA optimize; only after schema migrations, after a configurable number of writes, or during idle maintenance windows --- not unconditionally on every startup, which can add significant startup cost on large (500 MB+) databases. Set auto_vacuum=INCREMENTAL at database creation time and run periodic PRAGMA incremental_vacuum(N) steps (e.g., N=100 pages every 5 minutes of idle time) to reclaim index fragmentation. Avoid routine VACUUM --- it requires an exclusive lock and rewrites the entire database, which is slow and disruptive on a 500 MB+ index. If a full VACUUM is needed (e.g., after major schema changes or bulk deletions), schedule it explicitly as a user-initiated action with a progress indicator and estimated duration.

Node deletion (hard-delete with deletion journal). When a node is removed (file deleted, symbol removed), the system first writes a snapshot to the deletion_log table (node_id, symbol_key, file_path, chunk_hash, deleted_at; chunk_fingerprint added in Phase 2b), then hard-deletes the node. This applies to all node types including file nodes — file nodes are journaled before deletion so that Phase 2b file-level rename correlation can query the journal. ON DELETE CASCADE handles edge cleanup. The corresponding vec_nodes and fts_nodes rows are deleted in the same write transaction. This keeps all read paths clean --- no is_deleted filter needed anywhere.

Deletion journal purpose: In Phase 1, the journal is written but not read (cheap insurance). In Phase 2b, the rename detector queries it during the debounce window: "was a node with a similar fingerprint deleted in the last N seconds?" If a match is found, the new node reuses the old node_id, preserving edges and cached data automatically.

Deletion journal schema:

deletion_log(
  node_id             BLOB(16),
  symbol_key          TEXT NOT NULL,
  file_path           TEXT NOT NULL,
  chunk_hash          BLOB(32) NOT NULL,
  chunk_fingerprint   BLOB,        -- nullable; populated from Phase 2b
  deleted_at          TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
)

Journal sweep: Rows older than the retention period (default 1 hour) are deleted periodically during idle time. Single flat-table DELETE --- trivial compared to multi-table GC.

Embedding storage: Embeddings are stored as a single row per node in sqlite-vec. When a node's embedding is updated (e.g., on re-embedding), the row is overwritten in the write transaction. WAL mode ensures readers see a consistent snapshot. No append-only versioning, no version column, no pruning logic needed.

7.5 Write Queue and Backpressure

Single-threaded write queue (see Section 4.5.1). If depth exceeds threshold (default 10 batches, configurable), file watcher enters backpressure state. WAL mode ensures concurrent reads are not blocked by writes (see Section 4.5.2).

7.6 Database Size and Distribution

At 100K nodes, expect 500 MB--1 GB fully populated. The .codeagent/ directory is gitignored by default (a .gitignore file is created at .codeagent/.gitignore on first index). A 500 MB--1 GB SQLite file in version control causes merge conflicts, clone bloat, and CI churn. For CI scenarios, rebuilding from source is preferred (estimated rebuild time: ~1 min per 100K LOC) since CI environments typically have fresh checkouts. Deferred: An "Export Index" feature for team sharing (optionally stripping summaries and/or vectors, distributable via Git LFS or CI artifact stores) will be designed when team sharing becomes a priority. By that point there will be better data on what users actually need.

7.7 Configuration

All settings in .codeagent/config.json with namespaced sections: retrieval (weights, max_output_tokens, penalties), indexing ( hnsw_threshold, maintenance, idle_timeout, watchdog_threshold, write_debounce_ms, write_queue_depth, max_traversal_depth, max_signature_length, generated_file_patterns, mmap_size, safe_mode, allow_nuget_restore, follow_symlinks), embedding (model_name, dimensionality), orchestration (rate_limits: tpm_limit/rpm_limit/max_concurrent_sub_lm, model_context_limit). Anticipated future config sections (not parsed in Phase 1): external_apis, indexing.export_strip_summaries, indexing.export_strip_vectors, indexing.fingerprint_similarity_threshold, indexing.prewarm (Phase 5), orchestration.model_profile.

Environment variable overrides: Support environment variable overrides for all config keys using the pattern CODEAGENT_<SECTION>_<KEY> (e.g., CODEAGENT_INDEXING_WRITE_DEBOUNCE_MS=5000). This enables CI/CD scenarios where configuration changes should not be committed to the repository.

7.8 Schema Migrations

The _metadata table stores a schema_version integer (starting at 1) alongside the existing model identifier and dimensionality fields. On startup, the application compares the on-disk schema_version against the expected version compiled into the binary:

  • Version match: Proceed normally.

  • Older version on disk: Run forward migrations sequentially. Migrations are defined as numbered SQL scripts (e.g., migrations/002_add_node_spans.sql) and executed within a single transaction. Each migration updates schema_version on completion.

  • Newer version on disk (downgrade): Refuse to open the index and display an error directing the user to upgrade the application. Downgrades are not supported.

  • Missing _metadata table: Treat as version 0 (pre-migration schema). Run all migrations from the beginning.

Migration scripts are embedded in the application binary (Rust include_str!) and are not user-editable. Migrations must be idempotent where possible (use IF NOT EXISTS, IF EXISTS). A _migration_log table records each migration's version, timestamp, and duration for diagnostics.

8. Retrieval Pipeline

Accepts a user query string. Returns a ranked, token-bounded set of nodes with metadata and relevant source spans.

8.1 Query Analysis

  • Detect qualified symbol names (e.g., AuthService.Authenticate) → resolve via two-step lookup: (1) query nodes by qualified_name (indexed), which may return multiple overloads; (2) if ambiguous, let the retrieval pipeline rank candidates by parameter signature match and context. The (language, project_id, symbol_key, symbol_disambiguator) unique index is the canonical internal identity, but user queries provide qualified_name, not full symbol_key. No embedding needed for this channel.

  • Produce query embedding for semantic search (locally via the ONNX model).

  • Extract BM25 tokens. Pre-process for FTS5: if no qualified symbol, append wildcards and join with OR (e.g., "auth service" → "auth* OR service*"). FTS5 handles exact/prefix lookup; vector channel bridges the conceptual gap for natural-language queries.

Query intent classifier: Categorize as targeted (specific symbol/path → depth-first), cross-cutting (pattern across modules → maximize breadth), or exploratory (broad → diverse entry points).

Phase 4 heuristic rules (keyword-based): (1) Targeted: query contains a qualified symbol name (dot-separated identifier), a file path, or references a single specific entity. (2) Cross-cutting: query contains quantifiers ("all," "every," "across") combined with plural nouns, or references a pattern/trait rather than a specific symbol (e.g., "all external API calls," "every method that throws"). (3) Exploratory: query is a natural-language question without symbol references or quantifiers (e.g., "how does authentication work"). These rules are upgradeable to LLM-based classification in a later phase.

8.2 Parallel Retrieval

  • Vector channel: sqlite-vec ANN search. Top-K by cosine similarity.

  • BM25 channel: FTS5 full-text search over names and identifiers. Top-K by BM25 score.

  • Qualified-name channel: Qualified-name lookup via the qualified_name index on the nodes table. May return multiple candidates when overloads exist (e.g., methods with the same name but different parameter signatures). When a single match is found, it is returned with exact precision. When multiple candidates match, the retrieval pipeline disambiguates by parameter signature match and context, or returns all candidates ranked.

8.3 Simplified Reranker

Scoring uses Reciprocal Rank Fusion (RRF) to merge results from the three retrieval channels. Raw BM25 scores and cosine similarity values are on different scales and distributions; combining them via linear weighted sum produces unstable rankings across corpora and query types. RRF is robust, requires no per-corpus normalization, and is simple to implement:

base_rrf_score = sum over channels c of: 1 / (k + rank_c(d))

where k = 60 (standard constant), and rank_c(d) is the rank of document d in channel c's result list (absent documents receive rank = infinity, contributing 0). Channels: vector, BM25, qualified-name.

final_score = base_rrf_score × penalty_multipliers + metadata_boosts

Metadata boosts (additive, applied after RRF): - normalized_calling_file_count: +0.1 × (node's count / max count in candidate set, range 0--1). - is_public_api = true: +0.05.

Hard rule: qualified-name match always ranked first regardless of score.

Penalty multipliers (fixed policy, not tunable): generated = true → ×0.3. parse_status = syntactic_only → ×0.8. parse_status = failed → excluded.

Upgrade path: Additional boosts (centrality, recency, etc.) deferred. May only enter when eval harness identifies a specific failure pattern, manual inspection confirms which signal would fix it, and harness shows improvement without regression. Speculative additions prohibited.

8.4 Context Assembly

Per selected node: qualified name, file path + span, node type, key metadata (caller count, is_public_api), immediate graph neighbors (names only, no source). Source fetched on demand by the MCP client.

Context budget: The retrieval pipeline has a hard output cap: retrieval.max_output_tokens (default 16,384). The pipeline does not know about conversation history, system prompts, or model context limits --- that is the MCP client's responsibility. The client passes the desired limit when invoking retrieval tools. This separation ensures the retrieval pipeline can be tested independently (Phase 4a/4b) without client-side state.

8.5 Cross-Cutting Query Support

When intent = cross-cutting: increase K per channel by 2×, apply diversity penalty (penalize same-file nodes after standard rerank), group results by containing module (via contains edge traversal) and present module-by-module.

9. MCP Server

The index engine is exposed as an MCP (Model Context Protocol) server. Any MCP-compatible client connects to the same server using the same tools. See MCP_SERVER_SPEC.md for the full tool specification, including file system tools (list_directory, read_file, get_directory_tree), search and discovery tools (search_symbols, lookup_symbol, find_similar), inspection and navigation tools (get_symbol, get_source_spans, get_file_outline, get_callers, get_callees, get_implementations, get_references, get_dependencies, get_dependents), and engine management tools (index_files, get_status).

Key design principles:

  • The MCP server is a thin wrapper. All indexing logic remains in codeagent-core. File system operations use std::fs scoped to the repository root.

  • All tools are purely local — no network calls, no LLM involvement.

  • Read-only tools execute against the SQLite reader pool. Write operations (e.g., index_files) are routed through the writer channel.

  • Orchestration (deciding what tools to call, managing conversation state, planning) is the MCP client's responsibility, not the engine's.

10. Language Adapters

Each adapter walks a tree-sitter AST and compiler semantic model, emitting nodes/edges in the universal schema. Everything above the adapter is language-agnostic.

C# Adapter: tree-sitter-c-sharp + Roslyn (CSharpCompilation, SemanticModel). Child process architecture per Section 4.5.3; semantic pass behavior per Section 6.2. Adapter-specific details:

  • Node mapping: Namespace → module. Class, record, struct → class. Enum, delegate, type alias → type. Getter/setter → separate method nodes. Extension methods → method + extends edge.
  • Partial class merging: Same qualified_name across files → single node with multiple node_spans entries.
  • symbol_key: Roslyn's ISymbol.ToDisplayString() + parameter types. Phase 1 fallback (tree-sitter only): qualified_name + symbol kind + parameter count + parameter types-as-written + generic arity (overload-safe without Roslyn).
  • Edge extraction: LINQ query syntax desugared by Roslyn before call-edge extraction. Async flagged but no structural difference.

TypeScript/React Adapter: tree-sitter-typescript + TS Language Service. Child process architecture per Section 4.5.3; semantic pass behavior and safe mode per Section 6.2. Adapter-specific details:

  • Node mapping: Barrel index.ts files (with re-exports only) → module. Regular .ts/.tsx source files → file node (no separate module node); the file node serves as the container for contains edges to its symbols. React components identified heuristically (returns JSX, capital-letter name, or React.FC annotation) → node_type = component with accepts edge to props interface.
  • symbol_key: For exported symbols: package_id + export-qualified name + symbol kind + normalized signature (parameter count + types). Declaring file path stored as metadata only (not part of exported symbol_key). For non-exported file-scoped symbols: file-scoped name + file_id + symbol kind + normalized signature. symbol_disambiguator (declaration-header hash) for collision resolution among exported symbols.
  • Barrel files: Detected and short-circuited --- edges bypass to actual definitions.
  • Dynamic dispatch: Confidence-tagged edges (exact for resolved calls, probable for structural typing matches).

11. Implementation Phases

Phase 1 --- Foundation

  • SQLite setup: plain tables (nodes, edges, node_spans), sqlite-vec, FTS5, WITH RECURSIVE CTEs with depth limit (default 20 hops).

  • Universal node/edge schema definition (including project node type, symbol_key deterministic identity, is_public_api computed field, edge extractor_version field).

  • Schema migration infrastructure: _metadata.schema_version, numbered migration scripts, startup version check with forward migration.

  • File watcher + incremental change detection (chunk_hash comparison) + symlink/junction guard (realpath validation against repo root) + semantic-context file monitoring (detect and log changes to *.csproj, tsconfig*.json, package.json, lockfiles, and other project config files as semantic-context changes distinct from source changes; invalidation actions are deferred to Phase 2a when the semantic pass exists --- Phase 1 detects and records these events but has no semantic enrichment to re-trigger). Note: chunk_fingerprint computation, rename/move detection, and uncommitted rename correlation are deferred to Phase 2b.

  • Tree-sitter parsing harness for both languages (no semantic enrichment yet).

  • Write coalescing: debounce window, write queue with backpressure, WAL concurrency strategy (Section 4.5.2), burst recovery mode.

  • InvalidationPlanner (initial version): centralized component that accepts change events and outputs structured invalidation plans. Phase 1 scope: source-driven invalidation only (chunk_hash comparison → reparse/re-embed decisions, FTS/vector row rebuilds for changed nodes, stale edge deletion for changed scope). Unit-testable against the Section 6.6 decision matrix. Extended in Phase 2a (semantic-context invalidation, parse_status degradation rules) and Phase 2b (rename detection integration).

  • Local embedding module: ONNX Runtime integration (ort crate), model loading, batch inference pipeline.

  • Index integrity check on startup (PRAGMA quick_check; full integrity_check on crash recovery or user-triggered diagnostic) with full rebuild on corruption.

  • Basic structured logging: tool call name, latency, token count. Sufficient for Phase 4b retrieval tuning.

  • Health-check command (e.g., codeagent health) verifies SQLite integrity, migration status, and basic query functionality.

  • Verify graph construction on sample C# and TS projects.

  • Debug CLI: thin wrapper over graph query modules, outputting JSON to stdout. Supports get_node, get_neighbors, get_source, get_outline, filter_nodes, and basic qualified-name lookup (by qualified name). No LLM involvement, no retrieval pipeline. Primary validation and debugging tool for Phases 1 through 4a. Add search command after Phase 4b when hybrid retrieval is available.

  • Safe Indexing Mode design decision: define the safe-mode toggle, TS plugin stripping approach, MSBuild execution gating (with syntactic_only as the default safe-mode behavior for C#), NuGet restore gating, and user warning UI. Design the IPC protocol contract (handshake, cancellation, backpressure, error taxonomy) for Phase 2a implementation.

  • sqlite-vec validation spike: verify HNSW support with Rust-bundled SQLite, measure insert/update costs and query latency at 100K--500K vectors. Document fallback plan if HNSW is insufficient.

  • Tool latency budget definition: p50/p95 targets per tool type (simple lookup, graph traversal, FTS5, vector search). Include perf tests for representative query patterns at target scale.

Phase 2a --- Semantic Enrichment

  • Roslyn integration: call edges, type hierarchies, overrides, partial class merging.

  • TS Language Service integration: call edges, module boundaries, component detection.

  • Edge confidence scoring. Barrel file short-circuiting.

  • Project node detection: scan for .csproj files (C#) and package.json workspaces / monorepo project directories (TypeScript). Phase 1 limitation: until project detection completes in Phase 2a, all nodes are assigned to a synthetic "repo-root" project per language. In multi-project repositories where identically-named symbols exist across projects (e.g., multiple .csproj files with internal classes sharing names), this may produce symbol_key collisions. Phase 1 MUST detect and log these collisions with an actionable warning rather than silently producing corrupt data. Phase 2a resolves this by assigning nodes to their correct project nodes.

  • IPC protocol implementation: versioned handshake (protocol_version, capabilities, extractor_version), per-request cancellation and deadlines, backpressure rules, structured error taxonomy (semantic_unavailable, project_load_failed, plugin_blocked, timeout, oom_restart).

  • Safe Indexing Mode implementation: TS Language Service instantiated via programmatic API with tsconfig.json plugin section stripped/ignored. MSBuildWorkspace loading gated behind indexing.safe_mode toggle (default true); when safe mode is active, C# semantic enrichment runs as syntactic_only (tree-sitter). NuGet restore gated separately via indexing.allow_nuget_restore (default false). User-facing warning UI when safe mode is disabled.

Phase 2b --- Rename Detection

Depends on Phase 2a being stable (rename detection preserves edges produced by the semantic pass; testing requires those edges to exist).

  • Rename/move detection: chunk_fingerprint computation (token winnowing over identifier-normalized tokens, Jaccard similarity ≥0.80 threshold). Git-based rename detection (opportunistic, git diff --find-renames). Uncommitted rename correlation in file watcher debounce window. Symbol-level rename detection via two-tier similarity heuristic (same container, same kind, same arity, high body similarity, overlapping span). Node_id preservation on detected renames. node_identity_map table (schema migration) for cases where node_id cannot be preserved. The deletion journal (written since Phase 1) provides the matching window --- the rename detector queries recent journal entries to correlate deleted nodes with newly-created ones. Populate chunk_fingerprint column in the journal (nullable since Phase 1, now filled). Full chunk_fingerprint specification: two fingerprints are considered similar when their Jaccard similarity over winnowed shingles exceeds a configurable threshold (default 0.80). For file-level rename detection, compare file node's fingerprint. For symbol-level, compare body fingerprint excluding the declaration line.

Phase 3 --- Embedding

  • EmbeddingProvider trait and local ONNX model integration: HashEmbeddingProvider test stub (SHA-256, L2-normalised, 384-dim); production provider wraps the ONNX Runtime session.

  • vec_nodes SQLite table for embedding storage: ensure_vec_nodes_table, insert_or_replace_embedding, delete_embedding, handle_model_change (full wipe on model change).

  • load_sqlite_vec(): registers sqlite-vec as a global auto-extension before any connection is opened.

Phase 4a --- Retrieval Channels

  • Implement three channels independently (vector, BM25, qualified-name). Raw results, no reranker.

  • Context assembly with token budget enforcement (retrieval.max_output_tokens default 16,384).

  • Benchmark retrieval latency per-channel on large real-world codebase.

  • Integration test: verify sqlite-vec virtual table does not interfere with standard indexes on (target_id, type) used by calling_file_count computation.

Phase 4b --- Merge, Rerank, and Eval

  • Prerequisite: build precision/recall eval harness first. 100 curated queries across 5 archetypes (qualified-name 20, NL behavioral 25, impact analysis 20, cross-module 20, vague/broad 15). 70/30 train/holdout split. Primary metric: NDCG@10. Secondary: Precision@5, Recall@20, MRR, success@1, success@5. All per-archetype. Reference codebases: At least one substantial C# codebase and one TS/React codebase, each >10K nodes (roughly >50K LOC), with hand-labeled relevance judgments per query. Selection criteria: real-world complexity (multiple projects/ packages, non-trivial dependency graphs, mix of public and internal APIs), open-source or licensed for internal use. Curation of the query set and relevance labels is significant prep work --- begin during Phase 3 so it is ready when Phase 4b starts. Candidate sources: well-known OSS projects (e.g., a mid-size ASP.NET application for C#, a React + Node monorepo for TS), or internal codebases if available.

  • Latency metrics: Track p50 and p95 retrieval latency per archetype. Measure cold-cache (first query after index) vs warm-cache (subsequent queries) behavior separately. Target: sub-second p95 for warm cache on codebases up to 100K nodes.

  • Negative query coverage: Include at least 10 queries where the expected result is that ubiquitous utilities (e.g., ToString(), console.log) do NOT dominate the result set. Verify diversity penalty suppresses these.

  • Implement merge/dedup and RRF-based reranker (Section 8.3).

  • Validate RRF baseline with k=60. If NDCG@10 <0.6, evaluate alternative k values (20, 40, 80) and metadata boost weights. Evaluate holdout once.

  • Query intent classifier (keyword-based heuristic rules per Section 8.1) + breadth-first retrieval for cross-cutting queries.

  • NDCG@10 regression gate in CI.

Phase 5 --- MCP Server

  • Implement codeagent-mcp crate as a new workspace member.

  • File system tools: list_directory, read_file, get_directory_tree — sandboxed to repo root, respects .gitignore.

  • Index tools wrapping codeagent-core query functions: search_symbols, lookup_symbol, get_symbol, get_source_spans, get_file_outline, get_callers, get_callees, get_implementations, get_references, get_dependencies, get_dependents.

  • Engine management tools: index_files, get_status.

  • stdio and SSE transport support.

  • End-to-end testing with MCP client test harness.

  • See MCP_SERVER_SPEC.md for the full specification.

Phase 6 --- Hardening & Observability

Phase 6 expands the basic structured logging introduced in Phase 1 into full observability:

  • Performance profiling and hot-path optimization.

  • Documentation and contributor guide for new language adapters.

  • Large-repo stress testing and adversarial input handling.

12. Key Risks and Mitigations


Risk Mitigation

Roslyn / TS LS startup and memory Load on demand, idle timeout (15--30 min), memory watchdog (2 GB default). Project-level partitioning for large C# solutions. Isolated child processes --- crash doesn't affect main agent.

Arbitrary code execution during indexing (MSBuild targets, TS plugins) Safe Indexing Mode (Section 6.2): TS Language Service instantiated via programmatic API with plugins stripped from tsconfig.json. MSBuild execution gated behind safe-mode toggle (default: enabled/safe). NuGet restore gated separately. User warning when safe mode disabled. Child processes are isolated by process boundary. Filesystem sandboxing is best-effort and OS-dependent. The primary mitigation is safe mode: preventing MSBuild evaluation and TS plugin execution by default. If OS-level sandboxing is implemented (future), it will be documented per platform.

TS dynamic dispatch producing incorrect edges Mark with confidence; root LM treats approximate edges as hints, not facts.

sqlite-vec recall degrading at scale Auto HNSW index above 50K nodes. Tune index parameters.

Graph becoming stale Filesystem watcher + incremental pipeline. Uncommitted rename correlation added in Phase 2b. CI re-index on merge.

Index corruption Detected on startup via PRAGMA quick_check (fast). Full integrity_check on crash recovery or user-triggered diagnostic. Recovery: delete and full rebuild from source. Index is fully derivable. Display estimated rebuild time.

MCP tool misuse by clients All tools are read-only except index_files. File system access is sandboxed to the repo root. No graph mutation tools exposed. Limits on result sizes prevent unbounded responses.


13. Third-Party Licenses

All components permit commercial use without a commercial license.


Component License Notes

SQLite + FTS5 Public Domain No restrictions

sqlite-vec Apache 2.0 / MIT Dual-licensed

tree-sitter MIT Core parser library

tree-sitter-c-sharp MIT C# grammar

tree-sitter-typescript MIT TypeScript/TSX grammar

Roslyn (Microsoft.CodeAnalysis) MIT C# compiler platform

TypeScript Language Service Apache 2.0 Part of TypeScript compiler

all-MiniLM-L6-v2 Apache 2.0 Local embedding model (ONNX export)

ONNX Runtime (ort crate) MIT Rust bindings for ONNX Runtime

tokenizers (HuggingFace) Apache 2.0 Pure-Rust tokenizer


MIT and Apache 2.0 require attribution when distributing. Public Domain has no requirements.

14. Out of Scope for Phase 1

  • Additional languages (Python, Go, Java, Rust) --- adapter pattern designed for this, deferred.

  • Cross-language edges (e.g., C# backend → TS API via OpenAPI schema).

  • Code modification / diff application --- agent can read/reason; write actions separate.

  • Multi-repository indexing --- single repo scope.

  • Cloud-hosted index --- local SQLite only.

  • Oversized node chunking --- automatic subdivision of large classes/components (>2K lines). The read_file and get_source_spans MCP tools support line ranges for targeted access. Adversarial test cases for 5K+ line files included in Phase 1 and Phase 6 test suites.

15. Testing Strategy

See TESTS_IMPLEMENTATION_PLAN.md for the full test inventory (per-test IDs, descriptions, and status tracking across all phases).