Skip to content

Fix ref count cache inconsistencies#3176

Open
andrewbranch wants to merge 15 commits intomicrosoft:mainfrom
andrewbranch:copilot/general-herring
Open

Fix ref count cache inconsistencies#3176
andrewbranch wants to merge 15 commits intomicrosoft:mainfrom
andrewbranch:copilot/general-herring

Conversation

@andrewbranch
Copy link
Member

Fixes #3086

Unfortunately, the combination of simplifications in #2483 and #3062 interacted badly and introduced two possible ways of releasing source files from the cache too early.

Bug 1: building auto-imports while building a program

#2483 made the LoadOrStore operation of the ref counting cache not actually increment the ref count of the entry so we could remove some special handling of source file ref counts around program cloning. Instead, we would ref everything in a dedicated pass at the end of snapshot cloning. That was intended as a simplification to make the code easier to reason about, but having any period of time where entries in the cache have refCount == 0 but can't be deleted was a bad idea. The failure mode:

  • Request A causes a snapshot clone that builds a new program. Execution advances past program creation such that at least one source file is present in the parse cache with refCount == 0; the parseCache.Ref() call hasn't happened yet.
  • Request B meanwhile triggers an auto-import registry build. It gets the same source file created by the program in Request A, increments its ref count to 1, finishes work, and decrements its ref count to 0, causing it to be deleted from the cache.
  • Request A finally tries to ref every file included in the program it built, which panics because that file has been deleted from the cache in the meantime.

The fix here is to revert the simplification from #2483 and have parse cache entries begin life with refCount == 1.

The actual bug fixed in #2483 was not about parse cache entries at all, but about extended config cache entries, which used the same underlying ref counting cache infrastructure. The issue with extended config entries was their natural pattern of being pulled on by the config parser was extremely hard to follow, such that it was incredibly difficult to impossible to ensure that every snapshot refs every extended config exactly one time if incrementing the counter happened as a side effect of config parsing—that was where a single dedicated loop of reffing instead of reffing on the fly really paid off. As an alternative, I've switched extended config caches to store a set of owners instead of a count. This way, config parsing can immediately and atomically add the snapshot as an owner of anything it pulls on, but we can also do a pass at the end of snapshot cloning to ref every extended config without being concerned about double-reffing.

These two changes are in 55c29ba.

Note: I considered using the an owner cache for the parse cache too (owner key for that being program ID, not snapshot ID), but because there are typically orders of magnitude more source files than extended config files, I wanted to avoid the overhead of allocating a map on every parse cache entry.

Bug 2: building auto-imports by cloning a disposed snapshot

#3062 was another simplification that asserted that we don't need ref counting on snapshots, because disposing a snapshot doesn't make it unclonable—at worst, it throws away parse cache entries that could have been reused, but they're reconstructable, and we don't have any code paths that would benefit from keeping the cache entries around.

That was true, but the implementation created an asymmetry when cloning a snapshot that's already been disposed:

  • Request A asks for completions, obtains the current snapshot, and discovers that it needs to clone it to add auto-imports.
  • Request B meanwhile processes a document change and produces a new snapshot, calling Dispose() on the previous one, still held by Request A. The disposal drops the ref count for the old program to zero, which triggers the release of some files from the parse cache.
  • Request A clones the disposed snapshot to add auto-import info to it. No programs are updated, so parse cache entries are not incremented, but the old program still in the snapshot gets reffed in the program counter, recreating a new entry with refCount == 1 for it, an operation that can normally only happen in conjunction with reffing all its files in the parse cache.
  • Request A finishes, and since the session state has already moved on, the snapshot with added auto-imports gets disposed, not adopted by the session. Disposal derefs the program and its ref count reaches zero for the second time, triggering an extra, unbalanced deref of all its source files.
  • Request C processes a single-file change. Each file in the new program is reffed, but any that reached a ref count of zero in the previous step are missing, triggering the panic.

The fix: for newly created programs, unconditionally add them to the program counter. Otherwise, increment their ref count only if they still have an entry in the counter. If the entry is gone, they've been previously disposed, and re-adding them would cause an eventual double-dispose. 07af679

Alternatives considered: many. Most tempting is to delete the program counter entirely and unconditionally ref every source file across every program in every snapshot clone exactly once per occurrence, and do the same exact loop in snapshot disposal. This was in an earlier commit of #3062, but I benchmarked it as adding about 10ms per 100k files. 100k files is a lot and 10ms is a little, but at the same time, snapshot clones are very hot (typically one per keystroke plus change), so at the time I decided it was worth the added mental load to keep it as cheap as possible. I could now be convinced to take on that cost, but, I think the fix here is correct, and actually not too much code. But I am getting tired of this stuff being a problem, so simplifying is tempting.

andrewbranch and others added 2 commits March 20, 2026 09:07
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andrewbranch andrewbranch changed the title Copilot/general herring Fix ref count cache inconsistencies Mar 20, 2026
mu sync.Mutex
value V
refCount int
deleted bool
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed now that new entries always start with refCount == 1, at least that's nice

@andrewbranch andrewbranch marked this pull request as ready for review March 20, 2026 19:50
Copilot AI review requested due to automatic review settings March 20, 2026 19:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses cache lifetime/ref-counting races in the project system (parse cache + extended tsconfig cache) that can lead to premature cache entry deletion and panics during concurrent snapshot cloning and auto-import preparation.

Changes:

  • Restore “acquire-on-load” semantics for the parse cache so new source file cache entries start with refCount == 1, and adjust program cloning to ref unchanged files when UpdateProgram reuses them.
  • Replace extended tsconfig ref-counting with snapshot-ID ownership tracking (OwnerCache) and update clone/dispose paths accordingly.
  • Add/adjust tests covering cloning disposed snapshots and correct cache entry release behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/project/snapshot.go Updates program counter semantics for reused vs new programs; switches extended-config tracking to snapshot ownership acquire/release.
internal/project/session.go Updates cache stats logging to reflect OwnerCache entry type.
internal/project/refcountcache.go Reintroduces Acquire semantics for ref-counted cache entries (used by parse cache).
internal/project/parsecache.go Updates parse cache to use the new RefCountCache.Acquire-only API.
internal/project/compilerhost.go Ensures program construction acquires parse cache entries immediately via Acquire.
internal/project/project.go When UpdateProgram clones, refs unchanged files since only the changed file is acquired during update.
internal/project/programcounter.go Splits “new program” (Add) vs “ref existing only” (Ref returning bool) to avoid double-dispose patterns.
internal/project/autoimport.go Uses Acquire instead of Load+Ref for auto-import registry source file access and balances via Deref on dispose.
internal/project/ownercache.go Introduces OwnerCache for snapshot ownership tracking (used by extended config cache).
internal/project/extendedconfigcache.go Switches extended config caching from ref-counting to OwnerCache with expiration logic.
internal/project/configfileregistrybuilder.go Acquires extended configs using the current snapshot ID during config parsing.
internal/project/projectcollectionbuilder.go Passes snapshot ID into config file registry builder to enable per-snapshot ownership.
internal/project/refcountcache_test.go Expands tests for parse cache lifetime and cloning disposed snapshots.
internal/project/extendedconfigcache_test.go Updates test expectations from ref counts to owner counts and adds ownership scenarios.

Copy link
Member

@jakebailey jakebailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely like the invariant of ensuring we start with refcount at 1

@jakebailey
Copy link
Member

Then again, the copilot comments do sound plausible

@andrewbranch
Copy link
Member Author

andrewbranch commented Mar 20, 2026

Yes, they're correct. The last commit is a solution but I don't like it. It's kind of pointing towards "no, snapshots really should have a ref count," but we saw in #2905 that it's pretty confusing trying to track where to release snapshots since they're sometimes used explicitly and sometimes used implicitly by the language service. Language service objects would have be disposed too, or always come with a snapshot release func, or be used inside a closure, or whatever. It's not impossible to implement carefully and correctly... maybe that's better? Ugh

type OwnerCacheOptions struct {
DisableDeletion bool
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add documentation on what this is for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't used for anything, it was just copied from the ref count cache without its doc comment. Deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant on owner cache in general: what an owner cache is for. Also, I think we could probably merge the parse cache and this if we abstract the whole owner/ref counting thing into an interface or something, but not sure if it's worth it instead of maybe just adding a comment saying if you update one, you should update the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about sharing some code with an interface, but didn't want to introduce any unnecessary overhead into the parse cache. I can add a doc comment.

@jakebailey
Copy link
Member

Language service objects would have be disposed too, or always come with a snapshot release func, or be used inside a closure, or whatever. It's not impossible to implement carefully and correctly... maybe that's better? Ugh

That was where my PR was headed; I just never committed the wrapper closure code since we opted to not go that way. I don't really mind so long as things are working 😄

@gabritto
Copy link
Member

General question: I thought the reason the parse cache didn't make entries start with ref count 1 was that we could somehow parse a file that was temporary and didn't make it into a program. That's not possible, is it?

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andrewbranch
Copy link
Member Author

@gabritto not exactly, you might be thinking of the awkwardness around updating a program:

  • if multiple files have changed, we don't try to do a cheap clone and every file in the new program gets acquired once
  • if one file has changed, UpdateProgram acquires it to determine if it allows for a cheap clone, then:
    • if cloning is allowed, all other source files get copied in without being acquired, so we need to ref every file except that changed one afterwards
    • if cloning is not allowed, all files including the changed one get acquired again, resulting in the changed one being reffed twice, so we need to deref it afterwards. This was missing until the last commit, including prior to Fix extended config ref counting #2483—it was an undiscovered memory leak. That's a great example of why building programs without reffing, and then just reffing whatever ends up in the program, is easier 😞

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

@gabritto
Copy link
Member

@gabritto not exactly, you might be thinking of the awkwardness around updating a program:

  • if multiple files have changed, we don't try to do a cheap clone and every file in the new program gets acquired once

  • if one file has changed, UpdateProgram acquires it to determine if it allows for a cheap clone, then:

    • if cloning is allowed, all other source files get copied in without being acquired, so we need to ref every file except that changed one afterwards
    • if cloning is not allowed, all files including the changed one get acquired again, resulting in the changed one being reffed twice, so we need to deref it afterwards. This was missing until the last commit, including prior to Fix extended config ref counting #2483—it was an undiscovered memory leak. That's a great example of why building programs without reffing, and then just reffing whatever ends up in the program, is easier 😞

Ah yes, I think that was it. Thanks for the explanation. Seems there's no ideal solution 😅

andrewbranch and others added 3 commits March 20, 2026 15:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

@andrewbranch
Copy link
Member Author

I hate it all, but I don't have better ideas. We might consider going back to snapshot ref counts, but that would only eliminate the program counter's Add/Ref split and the extended config cache's TryAcquire. The parse cache complications would remain.


var missingFiles []string
files := make([]*ast.SourceFile, 0, totalFileCount-libFileCount)
duplicateSourceFiles := make([]*ast.SourceFile, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well just use var duplicateSourceFiles []*ast.SourceFile, otherwise it allocates an empty slice that's immediately overwritten on append.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

panic: cache entry not found

4 participants