-
Notifications
You must be signed in to change notification settings - Fork 5.3k
[clr-interp] Implement cached virtual/interface dispatch #123815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
We create a simple hashtable (InterpDispatchCache) that maps DispatchToken + target MT to target method to be called. The cache is similar to the `DispatchCache` used by VSD. It holds a single mapping per index, when a collision happens the entry will be replaced with the new one. Replaced entries are freed during GC. The expectation is that there will be few collisions given only a subset of methods are being interpreted.
|
Tagging subscribers to this area: @BrzVlad, @janvorli, @kg |
| AddIns(tailcall ? INTOP_CALLVIRT_TAIL : INTOP_CALLVIRT); | ||
| m_pLastNewIns->data[0] = GetDataItemIndex(callInfo.hMethod); | ||
| // Reserved slot for caching DispatchToken | ||
| m_pLastNewIns->data[1] = GetNewDataItemIndex(nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computing the DispatchToken here didn't seem to provide much benefit and it would require adding a lot of code for the addition to the EE interface, only for it to be used in the interpreter. Unless there is some existing api that might make it easy to compute something useful to use in the dispatch cache.
| { | ||
| LIMITED_METHOD_CONTRACT; | ||
|
|
||
| uint16_t dispatchTokenHash = DispatchToken::From_SIZE_T(dispatchToken).GetHash(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather expensive, would be great to use something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least we could store the precomputed dispatch token hash into the instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements cached virtual and interface method dispatch for the CoreCLR interpreter to improve performance of virtual method calls. The implementation adds a simple hash table (InterpDispatchCache) that caches the mapping from (DispatchToken, MethodTable*) to the resolved MethodDesc*, similar to the DispatchCache used by Virtual Stub Dispatch (VSD).
Changes:
- Implements a new InterpDispatchCache structure that holds 4096 entries (12-bit cache) with collision replacement strategy
- Refactors DispatchToken::GetHash() to be shared between VSD and interpreter caches
- Integrates cache cleanup into GC sync points
- Adds dispatch token caching slots to INTOP_CALLVIRT and INTOP_CALLVIRT_TAIL instructions
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/interpexec.cpp | Adds InterpDispatchCache implementation with lookup/insert/reclaim logic, CreateDispatchTokenForMethod helper, and integrates cache into virtual call execution path |
| src/coreclr/vm/virtualcallstub.cpp | Refactors hash function by moving token hashing logic into DispatchToken::GetHash() method to enable sharing with interpreter cache |
| src/coreclr/vm/contractimpl.h | Adds GetHash() method declaration to DispatchToken struct for computing 12-bit cache hash |
| src/coreclr/vm/syncclean.cpp | Adds call to InterpDispatchCache_ReclaimAll() during GC sync point to clean up dead cache entries |
| src/coreclr/interpreter/inc/intops.def | Changes instruction size from 4 to 5 bytes for INTOP_CALLVIRT and INTOP_CALLVIRT_TAIL to accommodate dispatch token cache slot |
| src/coreclr/interpreter/compiler.h | Adds GetNewDataItemIndex() helper method to allocate non-shared data item slots for runtime caching |
| src/coreclr/interpreter/compiler.cpp | Updates EmitCall to allocate dispatch token cache slots for virtual call instructions |
| size_t dispatchToken = (size_t)pMethod->pDataItems[ip[4]]; | ||
| if (dispatchToken == 0) | ||
| { | ||
| dispatchToken = CreateDispatchTokenForMethod(pMD); | ||
| pMethod->pDataItems[ip[4]] = (void*)dispatchToken; |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dispatch token is cached by writing to pMethod->pDataItems[ip[4]] without synchronization. If multiple threads execute the same callvirt instruction simultaneously, they might race to initialize the same dispatch token slot. While writing the same value should be benign, consider using InterlockedCompareExchangeT with nullptr as the expected value to ensure only one thread performs the initialization, similar to common double-checked locking patterns. This would make the intent clearer and avoid potential issues with torn writes on some architectures.
| #define DISPATCH_CACHE_BITS 12 | ||
| #define DISPATCH_CACHE_SIZE (1 << DISPATCH_CACHE_BITS) | ||
| #define DISPATCH_CACHE_MASK (DISPATCH_CACHE_SIZE - 1) | ||
|
|
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a static assertion to verify that DISPATCH_CACHE_BITS equals CALL_STUB_CACHE_NUM_BITS, since the hash function in DispatchToken::GetHash assumes a 12-bit cache. This ensures consistency between the interpreter dispatch cache and the VSD cache hash function, and would catch any future changes that inadvertently modify only one of these constants.
| static_assert(DISPATCH_CACHE_BITS == CALL_STUB_CACHE_NUM_BITS, "Interpreter dispatch cache size must match VSD dispatch cache size"); |
| #include "threadsuspend.h" | ||
|
|
||
| #ifdef FEATURE_INTERPRETER | ||
| void InterpDispatchCache_ReclaimAll(); |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forward declaration for InterpDispatchCache_ReclaimAll could be moved to a header file (e.g., interpexec.h) to make it more maintainable and discoverable. This would avoid duplication if other parts of the codebase need to reference this function in the future, and follows the general pattern of declaring public functions in headers rather than using forward declarations at call sites.
| void InterpDispatchCache_ReclaimAll(); | |
| #include "interpexec.h" |
| }; | ||
|
|
||
| // Global interpreter dispatch cache instance | ||
| static InterpDispatchCache g_InterpDispatchCache; |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global InterpDispatchCache variable g_InterpDispatchCache lacks explicit initialization. While C++ zero-initializes global variables (so m_cache will be all nullptr and m_pDeadList will be nullptr), it would be more explicit and safer to add a constructor to InterpDispatchCache that initializes these members, similar to how DispatchCache has an explicit constructor. This makes the initialization contract clear and prevents potential issues if the struct is used in other contexts.
| InterpDispatchCacheEntry* pDeadList = m_pDeadList; | ||
|
|
||
| while (pDeadList != nullptr) | ||
| { | ||
| InterpDispatchCacheEntry* pNext = pDeadList->pNextDead; | ||
| delete pDeadList; | ||
| pDeadList = pNext; | ||
| } | ||
|
|
||
| m_pDeadList = nullptr; |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ReclaimDeadEntries method reads m_pDeadList without synchronization before clearing it. While this is called during GC sync point when no other threads are running (as documented), the assignment m_pDeadList = nullptr on line 97 should use a volatile write or InterlockedExchangeT for consistency with how m_pDeadList is updated elsewhere via InterlockedCompareExchangeT. This ensures proper memory ordering and makes the code more robust if calling contexts change in the future.
| InterpDispatchCacheEntry* pEntry = m_cache[idx]; | ||
| // Data dependency ensures field reads are ordered after loading of `pEntry` | ||
| // The entry struct is immutable once created, so these reads are safe | ||
| if (pEntry != nullptr && pEntry->pMT == pMT && pEntry->dispatchToken == dispatchToken) | ||
| return pEntry->pTargetMD; |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Lookup method reads m_cache[idx] without acquire semantics, but the Insert method publishes entries using InterlockedExchangeT which has release semantics. To ensure proper memory ordering and guarantee that the fields of pEntry are visible after the pointer is loaded, consider using VolatileLoad to read the cache entry, similar to how VSD's DispatchCache::GetCacheEntry uses VolatileLoad. This ensures the data dependency chain is correctly established across all compiler optimizations and architectures.
|
@jkotas @davidwrighton Simplistic implementation from the discussion from the other day. Let me know if this approach seems ok. |
davidwrighton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to look at this more on Monday, but this is about what I thought we should build here.
|
|
||
| size_t idx = Hash(dispatchToken, pMT); | ||
|
|
||
| InterpDispatchCacheEntry* pEntry = m_cache[idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be a VolatileLoadWithoutBarrier or the compiler can do bad things like make this not a single load.
|
|
||
| ip += 4; | ||
| // Obtain the cached dispatch token or initialize it | ||
| size_t dispatchToken = (size_t)pMethod->pDataItems[ip[4]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a read of something lazy initialized, this needs to be a VolatileLoadWithoutBarrier to prevent compiler shenanigans.
| { | ||
| LIMITED_METHOD_CONTRACT; | ||
|
|
||
| uint16_t dispatchTokenHash = DispatchToken::From_SIZE_T(dispatchToken).GetHash(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least we could store the precomputed dispatch token hash into the instruction.
We create a simple hashtable (InterpDispatchCache) that maps DispatchToken + target MT to target method to be called. The cache is similar to the
DispatchCacheused by VSD. It holds a single mapping per index, when a collision happens the entry will be replaced with the new one. Replaced entries are freed during GC. The expectation is that there will be few collisions given only a subset of methods are being interpreted.This makes a microbenchmark from the suite 4x faster.