Replace line-based parser with tokenizer and walker#22
Replace line-based parser with tokenizer and walker#22JesseHerrick wants to merge 23 commits intomainfrom
Conversation
Swap ParseText from regex-over-joined-lines to a single-pass token walker that consumes []Token from a new Elixir tokenizer. The tokenizer handles strings, heredocs, sigils, and interpolation as single tokens, eliminating the need for line joining, StripCommentsAndStrings per line, and multi-line sigil/heredoc state tracking. Results on real-world .ex files (geomean across 5 files): - 2.7x faster (993µs → 362µs per file) - Throughput 84 MB/s → 231 MB/s Also improves correctness: the old line-based approach missed refs in multi-line bracket expressions and produced wrong line numbers for refs inside joined lines. Bump IndexVersion 10 → 11 (parse output differs in edge cases).
… following Replace line-based regex parsing with tokenizer-based token walking in parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines, ExtractImports, ExtractUses, and ExtractUsesWithOpts. This eliminates 15 compiled regexes, ~250 lines of heredoc/line-joining state machine, and fixes a regression where bracketDepth treated # in heredoc markdown links as comments — cascading into file-wide line merges that swallowed defmacro __using__ bodies (broke args_schema use-chain resolution). Make LookupFollowDelegate recursive (depth limit 5) so multi-hop delegate chains like PaymentsHub → FundFlowExecution → Handlers resolve to the actual implementation instead of stopping at the intermediate defdelegate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Advance inline-def parsing past parameter lists and def bodies so nested statements are not treated as top-level __using__/quote statements, and keep defmodule scope tracking active when `do` appears on the next line to avoid misattributed aliases.
…form processModuleDef stopped scanning at TokEOL, so `defmodule Foo\ndo` left TokDo to the main loop (double-counting depth) and misattributed functions after the inner module's end. Now scans past EOL with statement-boundary guards to avoid stealing a later module's TokDo. Also always emits the module Definition even for `, do:` one-liners so they are tracked in the store. No frame is pushed for inline modules since there is no do..end scope.
When collectModuleName encounters a non-TokModule token inside a
multi-alias brace block (e.g. an atom or number), it returns without
advancing the position. The three brace-scanning loops in
parseTextFromTokens, parseHelperQuoteBlock, and parseUsingBody were
missing the forward-progress guard that extractAliasesFromText already
had, causing them to spin forever. Add the same `if nk == k { k++ }`
guard to all three sites.
The `require Module, as: Name` syntax was not being parsed, so modules aliased via require couldn't be resolved for go-to-definition. Updated the parser and LSP alias extraction to handle this pattern.
The tokenizer emits `do:` as `TokIdent("do") + TokColon` via isKeywordKey,
never as `TokDo + TokColon`. Only block-opening `do` (without trailing
colon) produces TokDo, so checking if TokDo is followed by TokColon is
unreachable.
Made-with: Cursor
Backslash-escaped newlines (\\\n) inside strings, heredocs, sigils, interpolations, and char literals were skipped with i += 2 without incrementing the line counter. This caused all subsequent tokens to report line numbers that were too low, producing wrong go-to-definition targets (e.g. landing on line 588 instead of 594 in ecto/schema.ex). Fixed all 7 affected scan sites: scanStringContent, scanHeredocContent, scanInterpolation (2 sites), scanSigilContent (2 branches), and the main-loop char literal path. Added regression tests for each.
Centralize block/alias token scans across parser and LSP to prevent drift, and move hover doc extraction to tokenized paths with added regression tests.
This restores hover docs for non-quoted sigil forms and avoids false go-to-definition hits on attribute reference sites.
Prevent signature-help call detection from treating keywords like `if` as callable expressions in no-paren contexts, and add a regression test to ensure keyword forms do not produce false call contexts.
| } | ||
| case parser.TokDefmodule: | ||
| i = processModuleDef(i+1) - 1 // -1: loop post-increment will advance to the returned position | ||
| continue |
There was a problem hiding this comment.
Enclosing module extraction ignores defprotocol and defimpl
Low Severity
extractEnclosingModuleFromTokens only handles TokDefmodule in its switch, while the closely related extractAliasesFromTokens handles TokDefmodule, TokDefprotocol, and TokDefimpl. This means __MODULE__ resolution via ResolveModuleExpr and ExtractAliasBlockParent will return the wrong enclosing module (or empty string) when the cursor is inside a defprotocol or defimpl block, since those module-defining constructs are not tracked on the scope stack.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 3eb476d. Configure here.
| i += 2 | ||
| } | ||
| return strings.Join(parts, "."), i | ||
| } |
There was a problem hiding this comment.
Duplicated local helpers in parseUsingBody risk divergence
Low Severity
parseUsingBody defines local nextSig and collectModuleName closures that are functionally identical to parser.NextSigToken and parser.CollectModuleName. The same file already creates package-level aliases for these (tokNextSig, tokCollectModuleName) and uses them in every other function. Having a separate local copy creates a maintenance risk where a fix to one copy won't propagate to the other.
Reviewed by Cursor Bugbot for commit 3eb476d. Configure here.
- ExtractAliasBlockParent: assert parent on both Accounts and blank lines; assert defmodule line is not inside the block. - ExtractAliasesInScope: cover require ... as pairs on the same line as alias; document nextPos/for-loop regression. - parseUsingBody: add quote-body case for two semicolon-separated alias as forms. Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ce1086b. Configure here.
| aliases[parser.AliasShortName(childName)] = parent + "." + childName | ||
| } | ||
| i = nextPos - 1 | ||
| continue |
There was a problem hiding this comment.
Off-by-one in while-style loop skips backward
Medium Severity
In parseUsingBody, the main walk loop at line 1425 is a while-style for i < n && depth > 0 with no automatic post-increment — each branch manually advances i. The alias handling uses i = nextPos - 1; continue (lines 1496 and 1505), copied from the extractAliasesFromTokens standard for i := 0; i < n; i++ loop where - 1 compensates for the auto-increment. In the while-style loop, this causes i to point one token before the intended position, re-processing an already-handled token and potentially causing incorrect behavior or infinite loops.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ce1086b. Configure here.
| var ( | ||
| tokNextSig = parser.NextSigToken | ||
| tokCollectModuleName = parser.CollectModuleName | ||
| tokText = parser.TokenText |
There was a problem hiding this comment.
Unused exported variable tokText is dead code
Low Severity
The package-level variable tokText is assigned as an alias for parser.TokenText but is only used once (line 1208). Meanwhile, parser.TokenText is called directly in many other places in the same file. Having both tokText and direct parser.TokenText calls is inconsistent and the alias adds unnecessary indirection.
Reviewed by Cursor Bugbot for commit ce1086b. Configure here.


Replaces the regex-over-joined-lines approach to Elixir parsing with a new single-pass tokenizer and token walker. This is a correctness and performance overhaul — same parse output contract, better results.
What changed
New tokenizer (
internal/parser/tokenizer.go)A hand-written Elixir lexer that produces a flat
[]Tokenstream. It handles strings, heredocs, sigils, and interpolation as atomic tokens, which eliminates the need for per-line comment stripping, string blanking, and multi-line join state tracking that the old parser required.Rewritten parser (
internal/parser/parser_tokenized.go)A token walker replaces the old regex loop. Because the tokenizer handles quoting correctly, the walker never needs to guess whether it's inside a string — it just skips those token kinds.
LSP extraction functions (
internal/lsp/elixir.go)parseUsingBody,parseHelperQuoteBlock,extractAliasesFromLines,ExtractImports,ExtractUses, andExtractUsesWithOptsall ported to the tokenizer. Removes 15 compiled regexes and ~250 lines of heredoc/line-joining state machine.Multi-line alias block support
Added
ExtractAliasBlockParentand wired it into definition/hover/references/completion so modules insidealias Parent.{ Child, Other }blocks resolve correctly.Recursive delegate chain following
LookupFollowDelegateis now recursive (depth limit 5), so multi-hop chains likePayments → Billing → Handlersresolve to the actual implementation rather than stopping at the firstdefdelegate.Bug fixes
#inside heredoc markdown links was misread as a comment, which cascaded into line merges that swallowed entiredefmacro __using__bodies (broke use-chain resolution in some modules)require Module, as: Namedidn't register aliases for go-to-definitiondefdelegatechains (A → B → C) stopped at the intermediate delegate instead of resolving to the final targetalias Parent.{ Child }) — go-to-definition, hover, references, and completion didn't resolve child modulesuse Module, optsspanning multiple lines didn't parse the opts correctlyPerformance
On real-world
.exfiles (geomean across 5 files):2.7x faster, with better correctness.
Notes
IndexVersionbumped10 → 11(parse output differs in edge cases; existing indexes will be rebuilt on next startup)Note
Medium Risk
Replaces multiple LSP parsing paths with a new tokenizer/token-walker approach for expression, alias/use/import, using parsing, and hover doc extraction; correctness is improved but the changes touch core navigation/signature/hover behavior across the server.
Overview
Switches LSP-side Elixir parsing from line/regex heuristics to tokenizer-backed lookups, including cursor expression extraction (
ExpressionAtCursor/CallContextAtCursor), alias/import/use parsing (now supports multi-line forms), and__using__body analysis (dynamic opt bindings, helperquote dodelegation, and heredoc-safe scanning).Adds caching of token streams in
DocumentStoreand introducesTokenizedFileas the shared representation for multi-operation queries, then migrates hover doc/moduledoc extraction into a newelixir_docs.goimplementation.Improves module resolution edge cases (multi-line
alias Parent.{...}viaExtractAliasBlockParent, ignores strings/comments/heredocs, avoids hangs on unexpected tokens) and updates/expands tests to cover the new token-aware behavior and regressions.Reviewed by Cursor Bugbot for commit ce1086b. Bugbot is set up for automated code reviews on this repo. Configure here.