Skip to content

Replace line-based parser with tokenizer and walker#22

Open
JesseHerrick wants to merge 23 commits intomainfrom
feat-tokenizer
Open

Replace line-based parser with tokenizer and walker#22
JesseHerrick wants to merge 23 commits intomainfrom
feat-tokenizer

Conversation

@JesseHerrick
Copy link
Copy Markdown
Collaborator

@JesseHerrick JesseHerrick commented Apr 13, 2026

Replaces the regex-over-joined-lines approach to Elixir parsing with a new single-pass tokenizer and token walker. This is a correctness and performance overhaul — same parse output contract, better results.

What changed

New tokenizer (internal/parser/tokenizer.go)

A hand-written Elixir lexer that produces a flat []Token stream. It handles strings, heredocs, sigils, and interpolation as atomic tokens, which eliminates the need for per-line comment stripping, string blanking, and multi-line join state tracking that the old parser required.

Rewritten parser (internal/parser/parser_tokenized.go)

A token walker replaces the old regex loop. Because the tokenizer handles quoting correctly, the walker never needs to guess whether it's inside a string — it just skips those token kinds.

LSP extraction functions (internal/lsp/elixir.go)

parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines, ExtractImports, ExtractUses, and ExtractUsesWithOpts all ported to the tokenizer. Removes 15 compiled regexes and ~250 lines of heredoc/line-joining state machine.

Multi-line alias block support

Added ExtractAliasBlockParent and wired it into definition/hover/references/completion so modules inside alias Parent.{ Child, Other } blocks resolve correctly.

Recursive delegate chain following

LookupFollowDelegate is now recursive (depth limit 5), so multi-hop chains like Payments → Billing → Handlers resolve to the actual implementation rather than stopping at the first defdelegate.

Bug fixes

  • # inside heredoc markdown links was misread as a comment, which cascaded into line merges that swallowed entire defmacro __using__ bodies (broke use-chain resolution in some modules)
  • Multi-line bracket expressions: missed refs and produced incorrect line numbers
  • require Module, as: Name didn't register aliases for go-to-definition
  • Multi-hop defdelegate chains (A → B → C) stopped at the intermediate delegate instead of resolving to the final target
  • Multi-line alias blocks (alias Parent.{ Child }) — go-to-definition, hover, references, and completion didn't resolve child modules
  • use Module, opts spanning multiple lines didn't parse the opts correctly

Performance

On real-world .ex files (geomean across 5 files):

Before After
Time per file 993 µs 362 µs
Throughput 84 MB/s 231 MB/s

2.7x faster, with better correctness.

Notes

  • IndexVersion bumped 10 → 11 (parse output differs in edge cases; existing indexes will be rebuilt on next startup)

Note

Medium Risk
Replaces multiple LSP parsing paths with a new tokenizer/token-walker approach for expression, alias/use/import, using parsing, and hover doc extraction; correctness is improved but the changes touch core navigation/signature/hover behavior across the server.

Overview
Switches LSP-side Elixir parsing from line/regex heuristics to tokenizer-backed lookups, including cursor expression extraction (ExpressionAtCursor/CallContextAtCursor), alias/import/use parsing (now supports multi-line forms), and __using__ body analysis (dynamic opt bindings, helper quote do delegation, and heredoc-safe scanning).

Adds caching of token streams in DocumentStore and introduces TokenizedFile as the shared representation for multi-operation queries, then migrates hover doc/moduledoc extraction into a new elixir_docs.go implementation.

Improves module resolution edge cases (multi-line alias Parent.{...} via ExtractAliasBlockParent, ignores strings/comments/heredocs, avoids hangs on unexpected tokens) and updates/expands tests to cover the new token-aware behavior and regressions.

Reviewed by Cursor Bugbot for commit ce1086b. Bugbot is set up for automated code reviews on this repo. Configure here.

Swap ParseText from regex-over-joined-lines to a single-pass token walker
that consumes []Token from a new Elixir tokenizer. The tokenizer handles
strings, heredocs, sigils, and interpolation as single tokens, eliminating
the need for line joining, StripCommentsAndStrings per line, and multi-line
sigil/heredoc state tracking.

Results on real-world .ex files (geomean across 5 files):
- 2.7x faster (993µs → 362µs per file)
- Throughput 84 MB/s → 231 MB/s

Also improves correctness: the old line-based approach missed refs in
multi-line bracket expressions and produced wrong line numbers for refs
inside joined lines.

Bump IndexVersion 10 → 11 (parse output differs in edge cases).
@JesseHerrick JesseHerrick changed the base branch from fix/parser-edge-cases to main April 13, 2026 03:05
… following

Replace line-based regex parsing with tokenizer-based token walking in
parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines,
ExtractImports, ExtractUses, and ExtractUsesWithOpts. This eliminates
15 compiled regexes, ~250 lines of heredoc/line-joining state machine,
and fixes a regression where bracketDepth treated # in heredoc markdown
links as comments — cascading into file-wide line merges that swallowed
defmacro __using__ bodies (broke args_schema use-chain resolution).

Make LookupFollowDelegate recursive (depth limit 5) so multi-hop
delegate chains like PaymentsHub → FundFlowExecution → Handlers resolve
to the actual implementation instead of stopping at the intermediate
defdelegate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Advance inline-def parsing past parameter lists and def bodies so nested statements are not treated as top-level __using__/quote statements, and keep defmodule scope tracking active when `do` appears on the next line to avoid misattributed aliases.
…form

processModuleDef stopped scanning at TokEOL, so `defmodule Foo\ndo` left
TokDo to the main loop (double-counting depth) and misattributed functions
after the inner module's end. Now scans past EOL with statement-boundary
guards to avoid stealing a later module's TokDo.

Also always emits the module Definition even for `, do:` one-liners so
they are tracked in the store. No frame is pushed for inline modules
since there is no do..end scope.
When collectModuleName encounters a non-TokModule token inside a
multi-alias brace block (e.g. an atom or number), it returns without
advancing the position. The three brace-scanning loops in
parseTextFromTokens, parseHelperQuoteBlock, and parseUsingBody were
missing the forward-progress guard that extractAliasesFromText already
had, causing them to spin forever. Add the same `if nk == k { k++ }`
guard to all three sites.
The `require Module, as: Name` syntax was not being parsed, so modules
aliased via require couldn't be resolved for go-to-definition. Updated
the parser and LSP alias extraction to handle this pattern.
The tokenizer emits `do:` as `TokIdent("do") + TokColon` via isKeywordKey,
never as `TokDo + TokColon`. Only block-opening `do` (without trailing
colon) produces TokDo, so checking if TokDo is followed by TokColon is
unreachable.

Made-with: Cursor
Backslash-escaped newlines (\\\n) inside strings, heredocs, sigils,
interpolations, and char literals were skipped with i += 2 without
incrementing the line counter. This caused all subsequent tokens to
report line numbers that were too low, producing wrong go-to-definition
targets (e.g. landing on line 588 instead of 594 in ecto/schema.ex).

Fixed all 7 affected scan sites: scanStringContent, scanHeredocContent,
scanInterpolation (2 sites), scanSigilContent (2 branches), and the
main-loop char literal path. Added regression tests for each.
Centralize block/alias token scans across parser and LSP to prevent drift, and move hover doc extraction to tokenized paths with added regression tests.
@JesseHerrick JesseHerrick self-assigned this Apr 14, 2026
This restores hover docs for non-quoted sigil forms and avoids false go-to-definition hits on attribute reference sites.
Prevent signature-help call detection from treating keywords like `if`
as callable expressions in no-paren contexts, and add a regression test
to ensure keyword forms do not produce false call contexts.
}
case parser.TokDefmodule:
i = processModuleDef(i+1) - 1 // -1: loop post-increment will advance to the returned position
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enclosing module extraction ignores defprotocol and defimpl

Low Severity

extractEnclosingModuleFromTokens only handles TokDefmodule in its switch, while the closely related extractAliasesFromTokens handles TokDefmodule, TokDefprotocol, and TokDefimpl. This means __MODULE__ resolution via ResolveModuleExpr and ExtractAliasBlockParent will return the wrong enclosing module (or empty string) when the cursor is inside a defprotocol or defimpl block, since those module-defining constructs are not tracked on the scope stack.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3eb476d. Configure here.

i += 2
}
return strings.Join(parts, "."), i
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated local helpers in parseUsingBody risk divergence

Low Severity

parseUsingBody defines local nextSig and collectModuleName closures that are functionally identical to parser.NextSigToken and parser.CollectModuleName. The same file already creates package-level aliases for these (tokNextSig, tokCollectModuleName) and uses them in every other function. Having a separate local copy creates a maintenance risk where a fix to one copy won't propagate to the other.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3eb476d. Configure here.

- ExtractAliasBlockParent: assert parent on both Accounts and blank lines;
  assert defmodule line is not inside the block.
- ExtractAliasesInScope: cover require ... as pairs on the same line as alias;
  document nextPos/for-loop regression.
- parseUsingBody: add quote-body case for two semicolon-separated alias as forms.

Made-with: Cursor
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ce1086b. Configure here.

aliases[parser.AliasShortName(childName)] = parent + "." + childName
}
i = nextPos - 1
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-by-one in while-style loop skips backward

Medium Severity

In parseUsingBody, the main walk loop at line 1425 is a while-style for i < n && depth > 0 with no automatic post-increment — each branch manually advances i. The alias handling uses i = nextPos - 1; continue (lines 1496 and 1505), copied from the extractAliasesFromTokens standard for i := 0; i < n; i++ loop where - 1 compensates for the auto-increment. In the while-style loop, this causes i to point one token before the intended position, re-processing an already-handled token and potentially causing incorrect behavior or infinite loops.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce1086b. Configure here.

var (
tokNextSig = parser.NextSigToken
tokCollectModuleName = parser.CollectModuleName
tokText = parser.TokenText
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused exported variable tokText is dead code

Low Severity

The package-level variable tokText is assigned as an alias for parser.TokenText but is only used once (line 1208). Meanwhile, parser.TokenText is called directly in many other places in the same file. Having both tokText and direct parser.TokenText calls is inconsistent and the alias adds unnecessary indirection.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce1086b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant