vm: opcode-scoped intern cache on CallContext to eliminate duplicate `unique.Make()` by AskAlexSharov · Pull Request #20552 · erigontech/erigon

AskAlexSharov · 2026-04-14T09:36:55Z

Problem

Every EVM opcode that reads or writes storage (SLOAD, SSTORE,
BALANCE, EXTCODESIZE, etc.) calls accounts.InternKey or
accounts.InternAddress to convert a raw stack value into a
unique.Handle. For opcodes like SLOAD and SSTORE, this conversion
happens twice per dispatch: once in the gas function (e.g.
gasSLoadEIP2929) and once in the execute function (e.g. opSload).
These two phases are separated by the dispatch table and cannot
directly share a local variable.

unique.Make is not free: it traverses a global lock-free hash-trie
(canonMap) with several atomic loads per call. Under parallel
execution with many goroutines, cache misses on the global map become
a measurable bottleneck. Profiling BenchmarkSLOADWarm shows
canonMap.Load consuming ~16% of execution time.

Solution

Add a generation-counter cache to CallContext:

cacheGen uint64 // incremented once per opcode dispatch
cachedKeyGen uint64 // generation at which cachedKey was populated
cachedAddrGen uint64
cachedKey accounts.StorageKey
cachedAddr accounts.Address

                                                                                                                      
● goos: linux           
  goarch: amd64                                                                                                        
  pkg: github.com/erigontech/erigon/execution/vm/benchmark                                                             
  cpu: AMD EPYC 4344P 8-Core Processor                                                                                 
                                         │    main     │           alex/intern2_35            │                        
                                         │   sec/op    │   sec/op     vs base                 │                        
  NestedStaticCalls/depth-2-16                666.1m ±  1%   665.4m ± 1%        ~ (p=0.631 n=10)                       
  NestedStaticCalls/depth-4-16                725.5m ±  1%   726.3m ± 1%        ~ (p=0.912 n=10)                       
  NestedStaticCalls/depth-8-16                906.0m ±  0%   895.6m ± 0%   -1.16% (p=0.000 n=10)                       
  NestedStaticCalls/depth-16-16               916.8m ±  0%   901.5m ± 0%   -1.67% (p=0.000 n=10)                       
  DelegateCallProxy/1-layers-16               434.9m ±  1%   443.3m ± 2%   +1.92% (p=0.000 n=10)                       
  DelegateCallProxy/2-layers-16               468.5m ±  2%   476.7m ± 1%   +1.74% (p=0.015 n=10)                       
  DelegateCallProxy/4-layers-16               506.9m ±  1%   506.4m ± 2%        ~ (p=0.393 n=10)                       
  CallWithValue/no-value-16                   669.7m ±  1%   681.9m ± 0%   +1.82% (p=0.000 n=10)                       
  CallWithValue/with-value-16                  32.53m ±  1%    33.01m ± 1%   +1.46% (p=0.002 n=10)                     
  DeFiSwapChain/swap/100M-16                  214.9m ±  1%   205.6m ± 1%   -4.32% (p=0.000 n=10)                       
  PureArithmetic/add/1M-16                    1.641m ±  0%   1.648m ± 0%   +0.43% (p=0.035 n=10)                       
  PureArithmetic/add/10M-16                   16.37m ±  0%   16.47m ± 1%   +0.56% (p=0.001 n=10)                       
  PureArithmetic/add/100M-16                  164.6m ±  0%   165.4m ± 1%   +0.45% (p=0.004 n=10)                       
  PureArithmetic/mul/100M-16                  157.8m ±  0%   162.8m ± 0%   +3.16% (p=0.000 n=10)                       
  StackOps/dup-swap/100M-16                   180.9m ± 12%   185.0m ± 7%        ~ (p=0.796 n=10)                       
  MemoryOps/mstore-mload/100M-16              222.4m ±  0%   225.3m ± 1%   +1.34% (p=0.000 n=10)                       
  MemoryOps/mstore-growing/10M-16             3.609m ±  3%   3.604m ± 1%        ~ (p=0.123 n=10)                       
  Keccak256/32B/100M-16                       564.6m ±  0%   557.8m ± 0%   -1.20% (p=0.000 n=10)                       
  Keccak256/256B/100M-16                      604.4m ±  0%   604.2m ± 0%        ~ (p=0.529 n=10)                       
  Keccak256/4KB/100M-16                       812.2m ±  0%   808.8m ± 0%   -0.42% (p=0.002 n=10)                       
  MixedCompute/mixed/100M-16                  187.4m ±  5%   188.8m ± 4%        ~ (p=0.089 n=10)                       
  SLOADCold/10slots-16                        3.479µ ±  2%   3.210µ ± 2%   -7.72% (p=0.000 n=10)                       
  SLOADCold/50slots-16                        12.50µ ±  2%   10.96µ ± 1%  -12.31% (p=0.000 n=10)                       
  SLOADCold/100slots-16                       24.13µ ±  1%   21.55µ ± 1%  -10.68% (p=0.000 n=10)                       
  SLOADCold/500slots-16                       123.0µ ±  3%   110.7µ ± 1%   -9.97% (p=0.000 n=10)                       
  SLOADWarm/10slots-16                        150.6m ±  1%   125.7m ± 1%  -16.53% (p=0.000 n=10)                       
  SLOADWarm/50slots-16                        156.4m ±  1%   128.8m ± 2%  -17.65% (p=0.000 n=10)                       
  SLOADWarm/100slots-16                       159.7m ±  2%   133.9m ± 1%  -16.16% (p=0.000 n=10)                       
  SLOADWarm/500slots-16                       168.9m ±  1%   145.0m ± 1%  -14.12% (p=0.000 n=10)                       
  SSTORE/zero-to-nonzero-16                   148.3µ ±  3%   138.7µ ± 2%   -6.50% (p=0.000 n=10)                       
  SSTORE/nonzero-to-nonzero-16               153.8µ ±  4%   138.7µ ± 2%   -9.81% (p=0.000 n=10)                        
  SSTORE/nonzero-to-zero-16                   152.3µ ±  2%   142.1µ ± 2%   -6.71% (p=0.000 n=10)                       
  TransientStorage/10slots-16                 56.92m ±  1%   57.11m ± 1%        ~ (p=0.123 n=10)                       
  TransientStorage/100slots-16                59.49m ±  1%   59.11m ± 1%        ~ (p=0.218 n=10)                       
  TransientStorage/500slots-16                67.86m ±  1%   68.18m ± 0%        ~ (p=0.436 n=10)                       
  StorageDiversity/100slots-16                24.51µ ±  2%   22.09µ ± 1%   -9.85% (p=0.000 n=10)                       
  StorageDiversity/1000slots-16               258.1µ ±  1%   230.2µ ± 1%  -10.82% (p=0.000 n=10)                       
  ERC20Transfer/transfer/100M-16              147.2m ±  2%   139.2m ± 1%   -5.42% (p=0.000 n=10)                       
  ERC20TransferFrom/transferFrom/100M-16      195.2m ±  2%   184.6m ± 3%   -5.43% (p=0.000 n=10)                       
  ERC20BalanceOf/balanceOf/100M-16           147.1m ±  1%   127.6m ± 3%  -13.24% (p=0.000 n=10)                        
  ERC20BatchTransfers/batch-5-16              16.81µ ±  2%   16.35µ ± 2%   -2.75% (p=0.000 n=10)                       
  ERC20BatchTransfers/batch-10-16             32.14µ ±  3%   30.38µ ± 1%   -5.46% (p=0.000 n=10)                       
  ERC20BatchTransfers/batch-50-16             158.4µ ±  5%   147.0µ ± 1%   -7.20% (p=0.000 n=10)                       
  geomean                                      17.25m         16.49m        -4.40%                                     
                                                                                                                       
  pkg: github.com/erigontech/erigon/execution/vm/runtime                                                               
                                         │    main     │           alex/intern2_35            │
                                         │   sec/op    │   sec/op     vs base                 │                        
  EVM_CREATE_500-16                           9.254m ± 1%   8.794m ±  1%  -4.98% (p=0.000 n=10)
  EVM_CREATE2_500-16                          58.11m ± 0%   57.64m ±  0%  -0.82% (p=0.000 n=10)                        
  EVM_CREATE_1200-16                          14.74m ± 2%   14.63m ±  2%       ~ (p=0.143 n=10)                        
  EVM_CREATE2_1200-16                         50.59m ± 0%   49.86m ±  0%  -1.44% (p=0.000 n=10)                        
  EVM_RETURN/1000-16                          990.0n ± 0%   995.2n ±  0%  +0.54% (p=0.007 n=10)                        
  EVM_RETURN/10000-16                         1.811µ ± 1%   1.785µ ±  1%  -1.46% (p=0.000 n=10)                        
  EVM_RETURN/100000-16                        8.188µ ± 1%   7.789µ ±  1%  -4.87% (p=0.000 n=10)                        
  EVM_RETURN/1000000-16                       71.21µ ± 1%   67.06µ ±  2%  -5.83% (p=0.000 n=10)                        
  SimpleLoop/staticcall-identity-100M-16      141.6m ± 0%   153.8m ± 10%       ~ (p=0.481 n=10)                        
  SimpleLoop/call-identity-100M-16            177.8m ± 0%   179.4m ±  0%  +0.89% (p=0.000 n=10)                        
  SimpleLoop/loop-100M-16                     164.7m ± 0%   167.9m ±  2%  +1.98% (p=0.009 n=10)                        
  SimpleLoop/loop2-100M-16                    263.5m ± 1%   259.7m ±  1%  -1.43% (p=0.019 n=10)                        
  SimpleLoop/loop3-100M-16                    261.5m ± 1%   259.4m ±  1%  -0.82% (p=0.000 n=10)                        
  SimpleLoop/call-nonexist-100M-16            196.5m ± 1%   205.8m ±  8%       ~ (p=0.481 n=10)                        
  SimpleLoop/call-EOA-100M-16                 194.5m ± 0%   192.0m ±  1%  -1.24% (p=0.000 n=10)                        
  SimpleLoop/call-reverting-100M-16           242.6m ± 0%   247.4m ±  1%  +1.96% (p=0.000 n=10)                        
  EVM_SWAP1/10k-16                            51.31µ ± 0%   50.25µ ±  1%  -2.07% (p=0.000 n=10)                        
  geomean                                      6.205m        6.175m        -0.49%

yperbasis

Suggestions

opExtCodeCopy — _ = stack.pop() in a var block: This is valid Go but unusual to read. Consider pulling it out as a standalone statement:

addr := scope.peekAddress()
stack := &scope.Stack
stack.pop() // addr already consumed above
var (
memOffset = stack.pop()
codeOffset = stack.pop()
length = stack.pop()
)

Or just drop the var block entirely since it's no longer grouping a clean set of pops. Minor style nit.

PR description: The body is empty. Worth adding a sentence about the motivation (avoid double unique.Make() on gas+execute) and any benchmark delta. The prior callAddrTmp commit showed up to -23.6% on
repeated-CALL microbenchmarks — it'd be good to show numbers for the storage key side too.
makeCallVariantGasCallEIP2929 (line 180): This still does accounts.InternAddress(callContext.Stack.Back(1).Bytes20()) — reading from stack position 1, not 0, so peekAddress() can't help. Might be worth a
follow-up for a position-aware cache, but that's a separate concern.
No coverage of getCallContext initialization: The new cache fields start zero-valued (cachedKeyOk = false, cachedAddrOk = false), which is correct by default. But getCallContext doesn't explicitly reset them
— it relies on put() having already cleared them. This is fine as long as every CallContext goes through put() before being returned to the pool, which it does (checked run() in interpreter.go). Just noting
for awareness.

Verdict

The logic is correct. No semantic changes to EVM behavior — purely a performance optimization that avoids redundant intern-table lookups. Once the WIP items are resolved (description, benchmarks, possibly the
style nit), this is ready to merge.

Copilot

Pull request overview

This PR reduces repeated interning work in the EVM by adding small caches on CallContext for the current top-of-stack storage key and address, and then routing relevant gas/opcode paths through those cached helpers.

Changes:

Add CallContext.peekStorageKey() / CallContext.peekAddress() with per-context caching to avoid double-interning across dynamic gas + execution.
Replace direct accounts.InternKey/InternAddress calls in several opcode and EIP-2929 gas paths with the cached helpers.
Reset cache-valid flags when returning CallContext to the pool.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`execution/vm/operations_acl.go`	Switch EIP-2929 access-list gas paths to use cached stack-to-interned conversions.
`execution/vm/interpreter.go`	Add cached interned key/address fields + helper methods on `CallContext`.
`execution/vm/instructions.go`	Use cached conversions in BALANCE / EXTCODE* / SLOAD / SSTORE / SELFDESTRUCT opcode implementations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

execution/vm/interpreter.go

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yperbasis

Review: vm: opcode-scoped intern cache on CallContext

Summary

Adds a generation-counter cache (cacheGen + cachedKeyGen/cachedAddrGen) to CallContext so that opcodes like SLOAD, SSTORE, BALANCE, EXTCODE*, and SELFDESTRUCT don't call unique.Make twice per dispatch (once in the gas function, once in execute). The approach is clean — cacheGen increments at the top of the interpreter loop, and peekStorageKey()/peekAddress() only call InternKey/InternAddress when their local generation doesn't match.

Correctness: Verified all modified opcodes (SLOAD, SSTORE, BALANCE, EXTCODESIZE, EXTCODECOPY, EXTCODEHASH, SELFDESTRUCT, SELFDESTRUCT6780). Each consistently reads from stack position 0 in both the gas and execute phases, so the single-slot cache is sufficient. Generation invalidation works correctly: cacheGen starts at 0, gets incremented to 1 before the first opcode, and the pool put() resets all three counters to 0. No window where a stale cache value could be read.

The opExtCodeCopy rewrite from pop-first to peek-then-pop is semantically equivalent — peekAddress() reads the top entry, then the explicit pop() discards it.

Concrete concerns

1. put() doesn't nil the handle fields

func (c *CallContext) put() {
    c.Memory.reset()
    c.Stack.Reset()
    c.cacheGen = 0
    c.cachedKeyGen = 0
    c.cachedAddrGen = 0
    // cachedKey and cachedAddr still hold unique.Handle values
    contextPool.Put(c)
}

cachedKey (unique.Handle[common.Hash]) and cachedAddr (unique.Handle[common.Address]) aren't zeroed. While the generation counters prevent them from being used, the live handles keep their entries pinned in the global canonMap (preventing GC) for as long as the CallContext sits idle in the pool. In practice this is negligible — pool size is bounded by goroutine count — but for hygiene consider adding:

var zeroKey accounts.StorageKey
var zeroAddr accounts.Address
c.cachedKey = zeroKey
c.cachedAddr = zeroAddr

Or define package-level zero vars to avoid the allocation. Non-blocking, just flagging.

2. Small regressions on pure-compute benchmarks

PureArithmetic/mul shows +3.16%, MemoryOps/mstore-mload +1.34%, DelegateCallProxy +1.9%. These opcodes don't touch the cache but pay the cacheGen++ cost every dispatch. The increment itself is one instruction (non-atomic u64 on a local struct), so the regression likely comes from the changed struct layout: inserting ~56 bytes of cache fields between Memory and Stack shifts the 32KB Stack.data array, altering cache-line alignment for the hot loop. The -4.4% geomean makes this an acceptable tradeoff overall, but worth being aware of.

Observations (non-blocking)

makeCallVariantGasCallEIP2929 (operations_acl.go:180) still does raw InternAddress(callContext.Stack.Back(1).Bytes20()). Since it reads from position 1 instead of 0, the current single-slot cache can't help. A position-aware cache could address this in a follow-up.
uint64 overflow on cacheGen: At 1 gas per opcode and max ~30M gas per block, overflow would require ~6×10¹¹ blocks. Not a concern.

Verdict

The logic is correct, the benchmark data is compelling (SLOAD/SSTORE/ERC20 improvements are significant), and the implementation is minimal. The handle-pinning nit in put() is the only concrete suggestion.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T08:35:41Z

execution/vm/interpreter.go

+// peekStorageKey returns the top-of-stack value as an interned StorageKey.
+// The result is cached for the lifetime of one opcode dispatch (gas phase +
+// execute phase share the same cacheGen), so unique.Make is called at most
+// once per opcode.
+func (ctx *CallContext) peekStorageKey() accounts.StorageKey {
+	if ctx.cachedKeyGen == ctx.cacheGen {
+		return ctx.cachedKey
+	}
+	ctx.cachedKey = accounts.InternKey(ctx.Stack.peek().Bytes32())
+	ctx.cachedKeyGen = ctx.cacheGen
+	return ctx.cachedKey
+}
+
+// peekAddress returns the top-of-stack value as an interned Address.
+// Cached like peekStorageKey.
+func (ctx *CallContext) peekAddress() accounts.Address {
+	if ctx.cachedAddrGen == ctx.cacheGen {
+		return ctx.cachedAddr
+	}
+	ctx.cachedAddr = accounts.InternAddress(ctx.Stack.peek().Bytes20())
+	ctx.cachedAddrGen = ctx.cacheGen
+	return ctx.cachedAddr
+}


peekStorageKey/peekAddress cache validity is keyed only by cacheGen. Within the same opcode dispatch, if the stack top is modified (e.g., pop, swap, dup, or writing to *scope.Stack.peek()), subsequent calls in the same generation may return a stale interned key/address while the comment says it returns the current top-of-stack value. To prevent subtle misuse, either (a) explicitly document that these helpers must be called before any stack mutation in that opcode, or (b) include an additional validity check (e.g., stack index/value fingerprint) so the cache is invalidated when the top-of-stack changes.

Copilot · 2026-04-15T08:35:41Z

execution/vm/interpreter.go

+	c.cachedKeyGen = 0
+	c.cachedAddrGen = 0


put() resets cacheGen, cachedKeyGen, and cachedAddrGen all to 0, which makes the cache look valid (cached*Gen == cacheGen) on a freshly pooled CallContext before the interpreter loop has incremented cacheGen. If peekStorageKey/peekAddress is ever called before the first cacheGen++, it can return stale cachedKey/cachedAddr from a prior use. Consider initializing cachedKeyGen/cachedAddrGen to a sentinel value (e.g. ^uint64(0)) on put(), or initializing cacheGen to 1 in getCallContext, so the first peek is always a miss unless explicitly populated in the current dispatch.

Suggested change

c.cachedKeyGen = 0

c.cachedAddrGen = 0

c.cachedKeyGen = ^uint64(0)

c.cachedAddrGen = ^uint64(0)

AskAlexSharov added 2 commits April 14, 2026 16:32

save

87a58f3

save

a94744b

AskAlexSharov requested review from mh0lt and yperbasis as code owners April 14, 2026 09:36

save

f491846

yperbasis added the performance label Apr 14, 2026

yperbasis requested review from Copilot and taratorio April 14, 2026 10:30

Copilot started reviewing on behalf of yperbasis April 14, 2026 10:39 View session

yperbasis reviewed Apr 14, 2026

View reviewed changes

yperbasis added this to the 3.5.0 milestone Apr 14, 2026

Copilot AI reviewed Apr 14, 2026

View reviewed changes

execution/vm/interpreter.go Outdated Show resolved Hide resolved

AskAlexSharov added 5 commits April 15, 2026 08:27

save

ad995a2

save

a676467

save

d54bed9

save

39b4c87

Merge branch 'alex/intern2_35' into alex/intern_35

2fab422

AskAlexSharov changed the title ~~[wip] evm: less intern~~ [wip] vm: opcode-scoped intern cache on CallContext to eliminate duplicate unique.Make calls Apr 15, 2026

Merge branch 'main' into alex/intern_35

8e8ff2d

AskAlexSharov changed the title ~~[wip] vm: opcode-scoped intern cache on CallContext to eliminate duplicate unique.Make calls~~ vm: opcode-scoped intern cache on CallContext to eliminate duplicate unique.Make() Apr 15, 2026

AskAlexSharov requested review from Copilot and yperbasis April 15, 2026 04:20

Copilot started reviewing on behalf of AskAlexSharov April 15, 2026 04:20 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

yperbasis reviewed Apr 15, 2026

View reviewed changes

yperbasis approved these changes Apr 15, 2026

View reviewed changes

yperbasis requested a review from Copilot April 15, 2026 08:31

Copilot started reviewing on behalf of yperbasis April 15, 2026 08:32 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm: opcode-scoped intern cache on CallContext to eliminate duplicate `unique.Make()`#20552

vm: opcode-scoped intern cache on CallContext to eliminate duplicate `unique.Make()`#20552
AskAlexSharov wants to merge 9 commits intomainfrom
alex/intern_35

AskAlexSharov commented Apr 14, 2026 •

edited

Loading

Uh oh!

yperbasis left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

yperbasis left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AskAlexSharov commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Review: vm: opcode-scoped intern cache on CallContext

Summary

Concrete concerns

Observations (non-blocking)

Verdict

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AskAlexSharov commented Apr 14, 2026 •

edited

Loading