Replace line-based parser with tokenizer and walker by JesseHerrick · Pull Request #22 · remoteoss/dexter

JesseHerrick · 2026-04-13T02:49:28Z

Replaces the regex-over-joined-lines approach to Elixir parsing with a new single-pass tokenizer and token walker. This is a correctness and performance overhaul — same parse output contract, better results.

What changed

New tokenizer (internal/parser/tokenizer.go)

A hand-written Elixir lexer that produces a flat []Token stream. It handles strings, heredocs, sigils, and interpolation as atomic tokens, which eliminates the need for per-line comment stripping, string blanking, and multi-line join state tracking that the old parser required.

Rewritten parser (internal/parser/parser_tokenized.go)

A token walker replaces the old regex loop. Because the tokenizer handles quoting correctly, the walker never needs to guess whether it's inside a string — it just skips those token kinds.

LSP extraction functions (internal/lsp/elixir.go)

parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines, ExtractImports, ExtractUses, and ExtractUsesWithOpts all ported to the tokenizer. Removes 15 compiled regexes and ~250 lines of heredoc/line-joining state machine.

Multi-line alias block support

Added ExtractAliasBlockParent and wired it into definition/hover/references/completion so modules inside alias Parent.{ Child, Other } blocks resolve correctly.

Recursive delegate chain following

LookupFollowDelegate is now recursive (depth limit 5), so multi-hop chains like Payments → Billing → Handlers resolve to the actual implementation rather than stopping at the first defdelegate.

Bug fixes

# inside heredoc markdown links was misread as a comment, which cascaded into line merges that swallowed entire defmacro __using__ bodies (broke use-chain resolution in some modules)
Multi-line bracket expressions: missed refs and produced incorrect line numbers
require Module, as: Name didn't register aliases for go-to-definition
Multi-hop defdelegate chains (A → B → C) stopped at the intermediate delegate instead of resolving to the final target
Multi-line alias blocks (alias Parent.{ Child }) — go-to-definition, hover, references, and completion didn't resolve child modules
use Module, opts spanning multiple lines didn't parse the opts correctly

Performance

On real-world .ex files (geomean across 5 files):

	Before	After
Time per file	993 µs	362 µs
Throughput	84 MB/s	231 MB/s

2.7x faster, with better correctness.

Notes

IndexVersion bumped 10 → 11 (parse output differs in edge cases; existing indexes will be rebuilt on next startup)

Note

Medium Risk
Replaces multiple LSP parsing paths with a new tokenizer/token-walker approach for expression, alias/use/import, using parsing, and hover doc extraction; correctness is improved but the changes touch core navigation/signature/hover behavior across the server.

Overview
Switches LSP-side Elixir parsing from line/regex heuristics to tokenizer-backed lookups, including cursor expression extraction (ExpressionAtCursor/CallContextAtCursor), alias/import/use parsing (now supports multi-line forms), and __using__ body analysis (dynamic opt bindings, helper quote do delegation, and heredoc-safe scanning).

Adds caching of token streams in DocumentStore and introduces TokenizedFile as the shared representation for multi-operation queries, then migrates hover doc/moduledoc extraction into a new elixir_docs.go implementation.

Improves module resolution edge cases (multi-line alias Parent.{...} via ExtractAliasBlockParent, ignores strings/comments/heredocs, avoids hangs on unexpected tokens) and updates/expands tests to cover the new token-aware behavior and regressions.

^{Reviewed by Cursor Bugbot for commit ce1086b. Bugbot is set up for automated code reviews on this repo. Configure here.}

Swap ParseText from regex-over-joined-lines to a single-pass token walker that consumes []Token from a new Elixir tokenizer. The tokenizer handles strings, heredocs, sigils, and interpolation as single tokens, eliminating the need for line joining, StripCommentsAndStrings per line, and multi-line sigil/heredoc state tracking. Results on real-world .ex files (geomean across 5 files): - 2.7x faster (993µs → 362µs per file) - Throughput 84 MB/s → 231 MB/s Also improves correctness: the old line-based approach missed refs in multi-line bracket expressions and produced wrong line numbers for refs inside joined lines. Bump IndexVersion 10 → 11 (parse output differs in edge cases).

internal/parser/parser_tokenized.go

internal/parser/tokenizer.go

internal/lsp/elixir.go

… following Replace line-based regex parsing with tokenizer-based token walking in parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines, ExtractImports, ExtractUses, and ExtractUsesWithOpts. This eliminates 15 compiled regexes, ~250 lines of heredoc/line-joining state machine, and fixes a regression where bracketDepth treated # in heredoc markdown links as comments — cascading into file-wide line merges that swallowed defmacro __using__ bodies (broke args_schema use-chain resolution). Make LookupFollowDelegate recursive (depth limit 5) so multi-hop delegate chains like PaymentsHub → FundFlowExecution → Handlers resolve to the actual implementation instead of stopping at the intermediate defdelegate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

internal/lsp/elixir.go

Advance inline-def parsing past parameter lists and def bodies so nested statements are not treated as top-level __using__/quote statements, and keep defmodule scope tracking active when `do` appears on the next line to avoid misattributed aliases.

internal/parser/parser_tokenized.go

…form processModuleDef stopped scanning at TokEOL, so `defmodule Foo\ndo` left TokDo to the main loop (double-counting depth) and misattributed functions after the inner module's end. Now scans past EOL with statement-boundary guards to avoid stealing a later module's TokDo. Also always emits the module Definition even for `, do:` one-liners so they are tracked in the store. No frame is pushed for inline modules since there is no do..end scope.

internal/parser/parser_tokenized.go

When collectModuleName encounters a non-TokModule token inside a multi-alias brace block (e.g. an atom or number), it returns without advancing the position. The three brace-scanning loops in parseTextFromTokens, parseHelperQuoteBlock, and parseUsingBody were missing the forward-progress guard that extractAliasesFromText already had, causing them to spin forever. Add the same `if nk == k { k++ }` guard to all three sites.

The `require Module, as: Name` syntax was not being parsed, so modules aliased via require couldn't be resolved for go-to-definition. Updated the parser and LSP alias extraction to handle this pattern.

internal/lsp/elixir.go

The tokenizer emits `do:` as `TokIdent("do") + TokColon` via isKeywordKey, never as `TokDo + TokColon`. Only block-opening `do` (without trailing colon) produces TokDo, so checking if TokDo is followed by TokColon is unreachable. Made-with: Cursor

internal/parser/parser.go

internal/lsp/elixir.go

Backslash-escaped newlines (\\\n) inside strings, heredocs, sigils, interpolations, and char literals were skipped with i += 2 without incrementing the line counter. This caused all subsequent tokens to report line numbers that were too low, producing wrong go-to-definition targets (e.g. landing on line 588 instead of 594 in ecto/schema.ex). Fixed all 7 affected scan sites: scanStringContent, scanHeredocContent, scanInterpolation (2 sites), scanSigilContent (2 branches), and the main-loop char literal path. Added regression tests for each.

internal/lsp/elixir.go

Centralize block/alias token scans across parser and LSP to prevent drift, and move hover doc extraction to tokenized paths with added regression tests.

internal/lsp/elixir_docs.go

internal/lsp/elixir.go

This restores hover docs for non-quoted sigil forms and avoids false go-to-definition hits on attribute reference sites.

internal/lsp/elixir.go

Prevent signature-help call detection from treating keywords like `if` as callable expressions in no-paren contexts, and add a regression test to ensure keyword forms do not produce false call contexts.

cursor · 2026-04-14T16:34:34Z

internal/lsp/elixir.go

+			}
+		case parser.TokDefmodule:
+			i = processModuleDef(i+1) - 1 // -1: loop post-increment will advance to the returned position
+			continue


Enclosing module extraction ignores defprotocol and defimpl

Low Severity

extractEnclosingModuleFromTokens only handles TokDefmodule in its switch, while the closely related extractAliasesFromTokens handles TokDefmodule, TokDefprotocol, and TokDefimpl. This means __MODULE__ resolution via ResolveModuleExpr and ExtractAliasBlockParent will return the wrong enclosing module (or empty string) when the cursor is inside a defprotocol or defimpl block, since those module-defining constructs are not tracked on the scope stack.

Additional Locations (1)

internal/lsp/elixir.go#L841-L843

^{Reviewed by Cursor Bugbot for commit 3eb476d. Configure here.}

cursor · 2026-04-14T16:34:34Z

internal/lsp/elixir.go

+			i += 2
+		}
+		return strings.Join(parts, "."), i
+	}


Duplicated local helpers in parseUsingBody risk divergence

Low Severity

parseUsingBody defines local nextSig and collectModuleName closures that are functionally identical to parser.NextSigToken and parser.CollectModuleName. The same file already creates package-level aliases for these (tokNextSig, tokCollectModuleName) and uses them in every other function. Having a separate local copy creates a maintenance risk where a fix to one copy won't propagate to the other.

^{Reviewed by Cursor Bugbot for commit 3eb476d. Configure here.}

- ExtractAliasBlockParent: assert parent on both Accounts and blank lines; assert defmodule line is not inside the block. - ExtractAliasesInScope: cover require ... as pairs on the same line as alias; document nextPos/for-loop regression. - parseUsingBody: add quote-body case for two semicolon-separated alias as forms. Made-with: Cursor

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ce1086b. Configure here.}

cursor · 2026-04-14T19:54:19Z

internal/lsp/elixir.go

+					aliases[parser.AliasShortName(childName)] = parent + "." + childName
+				}
+				i = nextPos - 1
+				continue


Off-by-one in while-style loop skips backward

Medium Severity

In parseUsingBody, the main walk loop at line 1425 is a while-style for i < n && depth > 0 with no automatic post-increment — each branch manually advances i. The alias handling uses i = nextPos - 1; continue (lines 1496 and 1505), copied from the extractAliasesFromTokens standard for i := 0; i < n; i++ loop where - 1 compensates for the auto-increment. In the while-style loop, this causes i to point one token before the intended position, re-processing an already-handled token and potentially causing incorrect behavior or infinite loops.

Additional Locations (1)

internal/lsp/elixir.go#L1504-L1506

^{Reviewed by Cursor Bugbot for commit ce1086b. Configure here.}

cursor · 2026-04-14T19:54:19Z

internal/lsp/elixir.go

+var (
+	tokNextSig           = parser.NextSigToken
+	tokCollectModuleName = parser.CollectModuleName
+	tokText              = parser.TokenText


Unused exported variable tokText is dead code

Low Severity

The package-level variable tokText is assigned as an alias for parser.TokenText but is only used once (line 1208). Meanwhile, parser.TokenText is called directly in many other places in the same file. Having both tokText and direct parser.TokenText calls is inconsistent and the alias adds unnecessary indirection.

^{Reviewed by Cursor Bugbot for commit ce1086b. Configure here.}

JesseHerrick added 3 commits April 12, 2026 19:50

Fix parser: handle multi-line and edge case parsing

03cc708

Handle more edge-cases

1ecda73