Skip to content

vm: opcode-scoped intern cache on CallContext to eliminate duplicate unique.Make()#20552

Open
AskAlexSharov wants to merge 9 commits intomainfrom
alex/intern_35
Open

vm: opcode-scoped intern cache on CallContext to eliminate duplicate unique.Make()#20552
AskAlexSharov wants to merge 9 commits intomainfrom
alex/intern_35

Conversation

@AskAlexSharov
Copy link
Copy Markdown
Collaborator

@AskAlexSharov AskAlexSharov commented Apr 14, 2026

Problem

Every EVM opcode that reads or writes storage (SLOAD, SSTORE,
BALANCE, EXTCODESIZE, etc.) calls accounts.InternKey or
accounts.InternAddress to convert a raw stack value into a
unique.Handle. For opcodes like SLOAD and SSTORE, this conversion
happens twice per dispatch: once in the gas function (e.g.
gasSLoadEIP2929) and once in the execute function (e.g. opSload).
These two phases are separated by the dispatch table and cannot
directly share a local variable.

unique.Make is not free: it traverses a global lock-free hash-trie
(canonMap) with several atomic loads per call. Under parallel
execution with many goroutines, cache misses on the global map become
a measurable bottleneck. Profiling BenchmarkSLOADWarm shows
canonMap.Load consuming ~16% of execution time.

Solution

Add a generation-counter cache to CallContext:

cacheGen uint64 // incremented once per opcode dispatch
cachedKeyGen uint64 // generation at which cachedKey was populated
cachedAddrGen uint64
cachedKey accounts.StorageKey
cachedAddr accounts.Address

                                                                                                                      
● goos: linux           
  goarch: amd64                                                                                                        
  pkg: github.com/erigontech/erigon/execution/vm/benchmark                                                             
  cpu: AMD EPYC 4344P 8-Core Processor                                                                                 
                                         │    main     │           alex/intern2_35            │                        
                                         │   sec/op    │   sec/op     vs base                 │                        
  NestedStaticCalls/depth-2-16                666.1m ±  1%   665.4m ± 1%        ~ (p=0.631 n=10)                       
  NestedStaticCalls/depth-4-16                725.5m ±  1%   726.3m ± 1%        ~ (p=0.912 n=10)                       
  NestedStaticCalls/depth-8-16                906.0m ±  0%   895.6m ± 0%   -1.16% (p=0.000 n=10)                       
  NestedStaticCalls/depth-16-16               916.8m ±  0%   901.5m ± 0%   -1.67% (p=0.000 n=10)                       
  DelegateCallProxy/1-layers-16               434.9m ±  1%   443.3m ± 2%   +1.92% (p=0.000 n=10)                       
  DelegateCallProxy/2-layers-16               468.5m ±  2%   476.7m ± 1%   +1.74% (p=0.015 n=10)                       
  DelegateCallProxy/4-layers-16               506.9m ±  1%   506.4m ± 2%        ~ (p=0.393 n=10)                       
  CallWithValue/no-value-16                   669.7m ±  1%   681.9m ± 0%   +1.82% (p=0.000 n=10)                       
  CallWithValue/with-value-16                  32.53m ±  1%    33.01m ± 1%   +1.46% (p=0.002 n=10)                     
  DeFiSwapChain/swap/100M-16                  214.9m ±  1%   205.6m ± 1%   -4.32% (p=0.000 n=10)                       
  PureArithmetic/add/1M-16                    1.641m ±  0%   1.648m ± 0%   +0.43% (p=0.035 n=10)                       
  PureArithmetic/add/10M-16                   16.37m ±  0%   16.47m ± 1%   +0.56% (p=0.001 n=10)                       
  PureArithmetic/add/100M-16                  164.6m ±  0%   165.4m ± 1%   +0.45% (p=0.004 n=10)                       
  PureArithmetic/mul/100M-16                  157.8m ±  0%   162.8m ± 0%   +3.16% (p=0.000 n=10)                       
  StackOps/dup-swap/100M-16                   180.9m ± 12%   185.0m ± 7%        ~ (p=0.796 n=10)                       
  MemoryOps/mstore-mload/100M-16              222.4m ±  0%   225.3m ± 1%   +1.34% (p=0.000 n=10)                       
  MemoryOps/mstore-growing/10M-16             3.609m ±  3%   3.604m ± 1%        ~ (p=0.123 n=10)                       
  Keccak256/32B/100M-16                       564.6m ±  0%   557.8m ± 0%   -1.20% (p=0.000 n=10)                       
  Keccak256/256B/100M-16                      604.4m ±  0%   604.2m ± 0%        ~ (p=0.529 n=10)                       
  Keccak256/4KB/100M-16                       812.2m ±  0%   808.8m ± 0%   -0.42% (p=0.002 n=10)                       
  MixedCompute/mixed/100M-16                  187.4m ±  5%   188.8m ± 4%        ~ (p=0.089 n=10)                       
  SLOADCold/10slots-16                        3.479µ ±  2%   3.210µ ± 2%   -7.72% (p=0.000 n=10)                       
  SLOADCold/50slots-16                        12.50µ ±  2%   10.96µ ± 1%  -12.31% (p=0.000 n=10)                       
  SLOADCold/100slots-16                       24.13µ ±  1%   21.55µ ± 1%  -10.68% (p=0.000 n=10)                       
  SLOADCold/500slots-16                       123.0µ ±  3%   110.7µ ± 1%   -9.97% (p=0.000 n=10)                       
  SLOADWarm/10slots-16                        150.6m ±  1%   125.7m ± 1%  -16.53% (p=0.000 n=10)                       
  SLOADWarm/50slots-16                        156.4m ±  1%   128.8m ± 2%  -17.65% (p=0.000 n=10)                       
  SLOADWarm/100slots-16                       159.7m ±  2%   133.9m ± 1%  -16.16% (p=0.000 n=10)                       
  SLOADWarm/500slots-16                       168.9m ±  1%   145.0m ± 1%  -14.12% (p=0.000 n=10)                       
  SSTORE/zero-to-nonzero-16                   148.3µ ±  3%   138.7µ ± 2%   -6.50% (p=0.000 n=10)                       
  SSTORE/nonzero-to-nonzero-16               153.8µ ±  4%   138.7µ ± 2%   -9.81% (p=0.000 n=10)                        
  SSTORE/nonzero-to-zero-16                   152.3µ ±  2%   142.1µ ± 2%   -6.71% (p=0.000 n=10)                       
  TransientStorage/10slots-16                 56.92m ±  1%   57.11m ± 1%        ~ (p=0.123 n=10)                       
  TransientStorage/100slots-16                59.49m ±  1%   59.11m ± 1%        ~ (p=0.218 n=10)                       
  TransientStorage/500slots-16                67.86m ±  1%   68.18m ± 0%        ~ (p=0.436 n=10)                       
  StorageDiversity/100slots-16                24.51µ ±  2%   22.09µ ± 1%   -9.85% (p=0.000 n=10)                       
  StorageDiversity/1000slots-16               258.1µ ±  1%   230.2µ ± 1%  -10.82% (p=0.000 n=10)                       
  ERC20Transfer/transfer/100M-16              147.2m ±  2%   139.2m ± 1%   -5.42% (p=0.000 n=10)                       
  ERC20TransferFrom/transferFrom/100M-16      195.2m ±  2%   184.6m ± 3%   -5.43% (p=0.000 n=10)                       
  ERC20BalanceOf/balanceOf/100M-16           147.1m ±  1%   127.6m ± 3%  -13.24% (p=0.000 n=10)                        
  ERC20BatchTransfers/batch-5-16              16.81µ ±  2%   16.35µ ± 2%   -2.75% (p=0.000 n=10)                       
  ERC20BatchTransfers/batch-10-16             32.14µ ±  3%   30.38µ ± 1%   -5.46% (p=0.000 n=10)                       
  ERC20BatchTransfers/batch-50-16             158.4µ ±  5%   147.0µ ± 1%   -7.20% (p=0.000 n=10)                       
  geomean                                      17.25m         16.49m        -4.40%                                     
                                                                                                                       
  pkg: github.com/erigontech/erigon/execution/vm/runtime                                                               
                                         │    main     │           alex/intern2_35            │
                                         │   sec/op    │   sec/op     vs base                 │                        
  EVM_CREATE_500-16                           9.254m ± 1%   8.794m ±  1%  -4.98% (p=0.000 n=10)
  EVM_CREATE2_500-16                          58.11m ± 0%   57.64m ±  0%  -0.82% (p=0.000 n=10)                        
  EVM_CREATE_1200-16                          14.74m ± 2%   14.63m ±  2%       ~ (p=0.143 n=10)                        
  EVM_CREATE2_1200-16                         50.59m ± 0%   49.86m ±  0%  -1.44% (p=0.000 n=10)                        
  EVM_RETURN/1000-16                          990.0n ± 0%   995.2n ±  0%  +0.54% (p=0.007 n=10)                        
  EVM_RETURN/10000-16                         1.811µ ± 1%   1.785µ ±  1%  -1.46% (p=0.000 n=10)                        
  EVM_RETURN/100000-16                        8.188µ ± 1%   7.789µ ±  1%  -4.87% (p=0.000 n=10)                        
  EVM_RETURN/1000000-16                       71.21µ ± 1%   67.06µ ±  2%  -5.83% (p=0.000 n=10)                        
  SimpleLoop/staticcall-identity-100M-16      141.6m ± 0%   153.8m ± 10%       ~ (p=0.481 n=10)                        
  SimpleLoop/call-identity-100M-16            177.8m ± 0%   179.4m ±  0%  +0.89% (p=0.000 n=10)                        
  SimpleLoop/loop-100M-16                     164.7m ± 0%   167.9m ±  2%  +1.98% (p=0.009 n=10)                        
  SimpleLoop/loop2-100M-16                    263.5m ± 1%   259.7m ±  1%  -1.43% (p=0.019 n=10)                        
  SimpleLoop/loop3-100M-16                    261.5m ± 1%   259.4m ±  1%  -0.82% (p=0.000 n=10)                        
  SimpleLoop/call-nonexist-100M-16            196.5m ± 1%   205.8m ±  8%       ~ (p=0.481 n=10)                        
  SimpleLoop/call-EOA-100M-16                 194.5m ± 0%   192.0m ±  1%  -1.24% (p=0.000 n=10)                        
  SimpleLoop/call-reverting-100M-16           242.6m ± 0%   247.4m ±  1%  +1.96% (p=0.000 n=10)                        
  EVM_SWAP1/10k-16                            51.31µ ± 0%   50.25µ ±  1%  -2.07% (p=0.000 n=10)                        
  geomean                                      6.205m        6.175m        -0.49%                     

Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions

  1. opExtCodeCopy — _ = stack.pop() in a var block: This is valid Go but unusual to read. Consider pulling it out as a standalone statement:

addr := scope.peekAddress()
stack := &scope.Stack
stack.pop() // addr already consumed above
var (
memOffset = stack.pop()
codeOffset = stack.pop()
length = stack.pop()
)

Or just drop the var block entirely since it's no longer grouping a clean set of pops. Minor style nit.

  1. PR description: The body is empty. Worth adding a sentence about the motivation (avoid double unique.Make() on gas+execute) and any benchmark delta. The prior callAddrTmp commit showed up to -23.6% on
    repeated-CALL microbenchmarks — it'd be good to show numbers for the storage key side too.
  2. makeCallVariantGasCallEIP2929 (line 180): This still does accounts.InternAddress(callContext.Stack.Back(1).Bytes20()) — reading from stack position 1, not 0, so peekAddress() can't help. Might be worth a
    follow-up for a position-aware cache, but that's a separate concern.
  3. No coverage of getCallContext initialization: The new cache fields start zero-valued (cachedKeyOk = false, cachedAddrOk = false), which is correct by default. But getCallContext doesn't explicitly reset them
    — it relies on put() having already cleared them. This is fine as long as every CallContext goes through put() before being returned to the pool, which it does (checked run() in interpreter.go). Just noting
    for awareness.

Verdict

The logic is correct. No semantic changes to EVM behavior — purely a performance optimization that avoids redundant intern-table lookups. Once the WIP items are resolved (description, benchmarks, possibly the
style nit), this is ready to merge.

@yperbasis yperbasis added this to the 3.5.0 milestone Apr 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces repeated interning work in the EVM by adding small caches on CallContext for the current top-of-stack storage key and address, and then routing relevant gas/opcode paths through those cached helpers.

Changes:

  • Add CallContext.peekStorageKey() / CallContext.peekAddress() with per-context caching to avoid double-interning across dynamic gas + execution.
  • Replace direct accounts.InternKey/InternAddress calls in several opcode and EIP-2929 gas paths with the cached helpers.
  • Reset cache-valid flags when returning CallContext to the pool.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
execution/vm/operations_acl.go Switch EIP-2929 access-list gas paths to use cached stack-to-interned conversions.
execution/vm/interpreter.go Add cached interned key/address fields + helper methods on CallContext.
execution/vm/instructions.go Use cached conversions in BALANCE / EXTCODE* / SLOAD / SSTORE / SELFDESTRUCT opcode implementations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AskAlexSharov AskAlexSharov changed the title [wip] evm: less intern [wip] vm: opcode-scoped intern cache on CallContext to eliminate duplicate unique.Make calls Apr 15, 2026
@AskAlexSharov AskAlexSharov changed the title [wip] vm: opcode-scoped intern cache on CallContext to eliminate duplicate unique.Make calls vm: opcode-scoped intern cache on CallContext to eliminate duplicate unique.Make() Apr 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: vm: opcode-scoped intern cache on CallContext

Summary

Adds a generation-counter cache (cacheGen + cachedKeyGen/cachedAddrGen) to CallContext so that opcodes like SLOAD, SSTORE, BALANCE, EXTCODE*, and SELFDESTRUCT don't call unique.Make twice per dispatch (once in the gas function, once in execute). The approach is clean — cacheGen increments at the top of the interpreter loop, and peekStorageKey()/peekAddress() only call InternKey/InternAddress when their local generation doesn't match.

Correctness: Verified all modified opcodes (SLOAD, SSTORE, BALANCE, EXTCODESIZE, EXTCODECOPY, EXTCODEHASH, SELFDESTRUCT, SELFDESTRUCT6780). Each consistently reads from stack position 0 in both the gas and execute phases, so the single-slot cache is sufficient. Generation invalidation works correctly: cacheGen starts at 0, gets incremented to 1 before the first opcode, and the pool put() resets all three counters to 0. No window where a stale cache value could be read.

The opExtCodeCopy rewrite from pop-first to peek-then-pop is semantically equivalent — peekAddress() reads the top entry, then the explicit pop() discards it.

Concrete concerns

1. put() doesn't nil the handle fields

func (c *CallContext) put() {
    c.Memory.reset()
    c.Stack.Reset()
    c.cacheGen = 0
    c.cachedKeyGen = 0
    c.cachedAddrGen = 0
    // cachedKey and cachedAddr still hold unique.Handle values
    contextPool.Put(c)
}

cachedKey (unique.Handle[common.Hash]) and cachedAddr (unique.Handle[common.Address]) aren't zeroed. While the generation counters prevent them from being used, the live handles keep their entries pinned in the global canonMap (preventing GC) for as long as the CallContext sits idle in the pool. In practice this is negligible — pool size is bounded by goroutine count — but for hygiene consider adding:

var zeroKey accounts.StorageKey
var zeroAddr accounts.Address
c.cachedKey = zeroKey
c.cachedAddr = zeroAddr

Or define package-level zero vars to avoid the allocation. Non-blocking, just flagging.

2. Small regressions on pure-compute benchmarks

PureArithmetic/mul shows +3.16%, MemoryOps/mstore-mload +1.34%, DelegateCallProxy +1.9%. These opcodes don't touch the cache but pay the cacheGen++ cost every dispatch. The increment itself is one instruction (non-atomic u64 on a local struct), so the regression likely comes from the changed struct layout: inserting ~56 bytes of cache fields between Memory and Stack shifts the 32KB Stack.data array, altering cache-line alignment for the hot loop. The -4.4% geomean makes this an acceptable tradeoff overall, but worth being aware of.

Observations (non-blocking)

  • makeCallVariantGasCallEIP2929 (operations_acl.go:180) still does raw InternAddress(callContext.Stack.Back(1).Bytes20()). Since it reads from position 1 instead of 0, the current single-slot cache can't help. A position-aware cache could address this in a follow-up.

  • uint64 overflow on cacheGen: At 1 gas per opcode and max ~30M gas per block, overflow would require ~6×10¹¹ blocks. Not a concern.

Verdict

The logic is correct, the benchmark data is compelling (SLOAD/SSTORE/ERC20 improvements are significant), and the implementation is minimal. The handle-pinning nit in put() is the only concrete suggestion.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +82 to +104
// peekStorageKey returns the top-of-stack value as an interned StorageKey.
// The result is cached for the lifetime of one opcode dispatch (gas phase +
// execute phase share the same cacheGen), so unique.Make is called at most
// once per opcode.
func (ctx *CallContext) peekStorageKey() accounts.StorageKey {
if ctx.cachedKeyGen == ctx.cacheGen {
return ctx.cachedKey
}
ctx.cachedKey = accounts.InternKey(ctx.Stack.peek().Bytes32())
ctx.cachedKeyGen = ctx.cacheGen
return ctx.cachedKey
}

// peekAddress returns the top-of-stack value as an interned Address.
// Cached like peekStorageKey.
func (ctx *CallContext) peekAddress() accounts.Address {
if ctx.cachedAddrGen == ctx.cacheGen {
return ctx.cachedAddr
}
ctx.cachedAddr = accounts.InternAddress(ctx.Stack.peek().Bytes20())
ctx.cachedAddrGen = ctx.cacheGen
return ctx.cachedAddr
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peekStorageKey/peekAddress cache validity is keyed only by cacheGen. Within the same opcode dispatch, if the stack top is modified (e.g., pop, swap, dup, or writing to *scope.Stack.peek()), subsequent calls in the same generation may return a stale interned key/address while the comment says it returns the current top-of-stack value. To prevent subtle misuse, either (a) explicitly document that these helpers must be called before any stack mutation in that opcode, or (b) include an additional validity check (e.g., stack index/value fingerprint) so the cache is invalidated when the top-of-stack changes.

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +130
c.cachedKeyGen = 0
c.cachedAddrGen = 0
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put() resets cacheGen, cachedKeyGen, and cachedAddrGen all to 0, which makes the cache look valid (cached*Gen == cacheGen) on a freshly pooled CallContext before the interpreter loop has incremented cacheGen. If peekStorageKey/peekAddress is ever called before the first cacheGen++, it can return stale cachedKey/cachedAddr from a prior use. Consider initializing cachedKeyGen/cachedAddrGen to a sentinel value (e.g. ^uint64(0)) on put(), or initializing cacheGen to 1 in getCallContext, so the first peek is always a miss unless explicitly populated in the current dispatch.

Suggested change
c.cachedKeyGen = 0
c.cachedAddrGen = 0
c.cachedKeyGen = ^uint64(0)
c.cachedAddrGen = ^uint64(0)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants