Skip to content

index: 3.7x faster posting list construction via direct-indexed ASCII array#1020

Open
clemlesne wants to merge 4 commits intosourcegraph:mainfrom
clemlesne:optimize-posting-allocations
Open

index: 3.7x faster posting list construction via direct-indexed ASCII array#1020
clemlesne wants to merge 4 commits intosourcegraph:mainfrom
clemlesne:optimize-posting-allocations

Conversation

@clemlesne
Copy link

Summary

Reduces newSearchableString build time by 3.7x (9.3s → 2.5s on kubernetes) through three targeted changes to the trigram posting list construction hot path, where ~54% of CPU was previously spent on Go runtime memory management.

  • Merge postings and lastOffsets maps into a single map[ngram]*postingList. Pointer-stored values mean the map is only written when a new ngram is first seen (~282K writes) rather than on every trigram occurrence (~169M). Cuts per-trigram map operations from 4 to ~1.
  • Direct-indexed [2M]*postingList array for ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB). Eliminates hash/probe overhead for 99%+ of trigrams in source code. Non-ASCII trigrams fall back to the map.
  • Pool postingsBuilder via sync.Pool so sequential/parallel shard builds reuse map, array, and slice allocations across shards.

Also adds an ASCII fast-path (data[0] < utf8.RuneSelf) to skip utf8.DecodeRune call overhead, and pre-sizes maps and posting slices based on shard size.

Benchmarks

On kubernetes (22.9K files, 169 MB), Apple M1 Max:

Cold path (NewSearchableString)

Metric Before After Change
Time 9.3s 2.5s -73%
Allocs/op 901K 676K -25%
B/op 1,358 MB 1,544 MB +14%

Warm path (pooled reuse across shards)

Metric Before After Change
Time 9.3s 2.3s -75%
Allocs/op 901K 23K -97%
B/op 1,358 MB 0.6 MB -99.96%

Write path (WritePostings)

~155ms → ~130ms. Unchanged.

Context

Filed as #1017. seek wraps zoekt as a local CLI for AI coding agents — every search is a blocking tool call, so cold-index speed matters.

Test plan

  • go test ./index/ -race passes (all 50+ tests including golden shard TestBuildv16)
  • go vet and golangci-lint clean (no new issues)
  • Unicode tests pass (TestUnicodeVariableLength, TestAndOrUnicode)
  • Benchmark on a second large repo (e.g., torvalds/linux)
  • Profile with pprof to confirm runtime memory management dropped from ~54% to <20%

🤖 Generated with Claude Code

Three targeted changes to newSearchableString, the hot path where ~54%
of CPU was spent on runtime memory management (memclr, memmove, madvise,
mapassign):

1. Merge postings and lastOffsets maps into a single map[ngram]*postingList.
   Pointer-stored values mean the map is only written when a new ngram is
   first seen (~282K writes for kubernetes) rather than on every trigram
   occurrence (~169M). This cuts per-trigram map operations from 4 to ~1.

2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate
   each posting list to 1024 bytes, reducing slice growth events and
   eliminating map rehashing.

3. Pool postingsBuilder instances via sync.Pool on the Builder, so
   sequential and parallel shard builds reuse map and slice allocations
   across shards instead of re-creating them.

Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):

  Cold path (NewSearchableString):
    Time:   9.3s → 5.3s   (-43%)
    Allocs: 901K → 677K   (-25%)
    B/op:   1358 → 1536 MB (+13%, from larger pre-alloc)

  Warm path (pooled reuse across shards):
    Time:   9.3s → 5.1s   (-45%)
    Allocs: 901K → 23K    (-97%)
    B/op:   1358 → 0.6 MB (-99.96%)

  WritePostings: ~155ms, unchanged.
Replace map lookups with a direct-indexed [2M]*postingList array for
ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers).
Since 99%+ of trigrams in source code are pure ASCII, this eliminates
nearly all hash and probe overhead from the hot loop. Non-ASCII
trigrams still fall back to the map.

Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid
utf8.DecodeRune function call overhead on the 95-99% of bytes that
are ASCII.

In writePostings, collect ngrams from both the array and map into a
single sorted slice for writing. The on-disk format is unchanged.

Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):

  Cold path (NewSearchableString):
    Before (with map opt):  5.3s, 677K allocs, 1536 MB
    After:                  2.5s, 676K allocs, 1544 MB
    Speedup:                2.1x (3.7x vs original baseline)

  Warm path (pooled reuse):
    Before (with map opt):  5.1s, 23K allocs, 0.6 MB
    After:                  2.3s, 23K allocs, 0.6 MB
    Speedup:                2.2x (4.0x vs original baseline)

  WritePostings: ~130ms, unchanged.

The non-ASCII map now holds only ~6K entries (vs ~282K before), since
the vast majority of trigrams are served by the direct array.
Update three comments that still referenced map-only storage after the
direct-indexed ASCII array was added: postingList doc, estimateNgrams
doc, and reset() doc.
@keegancsmith
Copy link
Member

Haven't looked at code yet, just responding to the description

  • Direct-indexed [2M]*postingList array for ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB). Eliminates hash/probe overhead for 99%+ of trigrams in source code. Non-ASCII trigrams fall back to the map.

Clever! Can you quantify how much of a difference this optimization makes? I would of thought we are dominated by what happens in *postingList rather than growing the map / lookups in the map.

  • Pool postingsBuilder via sync.Pool so sequential/parallel shard builds reuse map, array, and slice allocations across shards.

I'd also be interested in how effective this is. I was under the impression the need for sync.Pool was pretty rare these days after improvements to the golang runtime. However, I did just grep the go stdlib and I see it is still pretty heavily used.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading all the code, I'd be interesting on the impact of this but not with a normal go benchmark. But rather with calling something like zoekt-git-index on a repo using hyperfine to measure the impact.

Comment on lines +278 to +282

// postingsPool reuses postingsBuilder instances across shard builds,
// retaining their map and slice allocations to avoid repeated
// memclr/madvise overhead.
postingsPool sync.Pool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda surprised this pool has a hit rate. Under what conditions did you test this? Did you have smaller shard size?

The kubernetes repo is a handful of shards right? With concurrency the opportunity to re-use an older postingsPool before garbage collection has had a chance to free stuff from sync.Pool seems low?

Maybe the GC just doesn't free as often as I would expect from sync.Pool.

If you are just testing this with all the normal defaults (including GOGC being on, etc) and its working cool lets use this optimization. If you are tweaking the defaults somehow, then we could probably do better with maintaining our own free list instead of relying on sync.Pool

Copy link
Author

@clemlesne clemlesne Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default Parallelism is 4, so up to 4 shards build concurrently. sync.Pool has per-P caching — each goroutine gets/puts on its own P cache without contention. After a goroutine finishes and Puts, the next goroutine scheduled on that P Gets the cached builder. Worst case (GC clears pool between shards) = fresh allocation, identical to no pool.

Tested with normal defaults (GOGC=100, no tuning).

content := b.getPostingsBuilder()
name := b.getPostingsBuilder()
shardBuilder := newShardBuilderWithPostings(content, name)
if err := shardBuilder.setRepository(&desc); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an error here is super rare, is it worth complicating matter and doing a put after this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removed in ad39bc9.

index/builder.go Outdated
return newPostingsBuilder(b.opts.ShardMax)
}

func (b *Builder) putPostingsBuilder(pb *postingsBuilder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this take in a shardBuilder instead and it does put on both contentPostings and namePostings. And finally it sets those fields to nil so if we misuse this something crashes in a more obvious way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — returnPostingsBuilders(*ShardBuilder) puts both and nils the fields. ad39bc9.

Comment on lines +376 to +377
newPostingsBuilder(defaultShardMax),
newPostingsBuilder(defaultShardMax),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the places we call newShardBuilder we should be able to pass in the actual shard max?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — newShardBuilder(shardMax int), 0 means default. merge.go and tests pass 0. Builder path already had it via the pool. ad39bc9.

index/write.go Outdated
all = append(all, ngramPosting{k, pl})
}
}
sort.Slice(all, func(i, j int) bool { return all[i].ng < all[j].ng })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: slices.SortFunc now exists and generally reads better / is faster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched in ad39bc9.

// git clone --depth=1 https://github.com/kubernetes/kubernetes /tmp/k8s
// ZOEKT_BENCH_REPO=/tmp/k8s go test ./index/ -bench=BenchmarkPostings -benchmem -count=5 -timeout=600s

func requireBenchRepo(b *testing.B) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we have some infra here to download copies of code for these sorts of tests internal/e2e/e2e_rank_test.go. Not sure if worth using here. But if there is a way to maybe share ideas / infra with that... that would be great.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pointer, thanks. Will look into aligning with e2e_rank_test.go infra in a follow-up.

// BenchmarkPostings_Reuse measures the warm path: building postings with a
// reset (pooled) postingsBuilder that retains its map and slice allocations
// from a previous shard build.
func BenchmarkPostings_Reuse(b *testing.B) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this accurately measures the actual impact on reuse. This is since we won't be calling newPostingsBuilder in a hot loop. Instead it will be called once per shard.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the Reuse benchmark measures allocation reduction potential, not realistic shard-building throughput. The real end-to-end impact measured via hyperfine on kubernetes:

Benchmark 1 (before): 11.373 s ± 0.174 s
Benchmark 2 (after):   7.866 s ± 0.378 s  (1.45x faster, no ctags)

Benchmark 1 (before): 20.478 s ± 0.465 s
Benchmark 2 (after):  16.588 s ± 0.119 s  (1.23x faster, with ctags)

- Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder)
  that returns both content and name builders to the pool and nils the
  fields, so any subsequent misuse panics obviously.

- Drop error-path pool returns in newShardBuilder: setRepository errors
  are extremely rare (invalid templates or >64 branches), not worth the
  code complexity.

- Thread shardMax through newShardBuilder(shardMax int). Callers without
  a shard size context (merge, tests, public API) pass 0 for the default
  (100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool.

- Switch sort.Slice to slices.SortFunc in writePostings for type safety
  and to avoid interface boxing overhead.
@clemlesne
Copy link
Author

Per-optimization breakdown (microbenchmark on kubernetes, NewSearchableString):

Optimization Time Speedup
Baseline 9.3s
+ Pointer-stored map values + pre-sizing 5.3s 1.8x
+ Direct ASCII array 2.5s 3.7x cumulative

The ASCII array accounts for the majority of the remaining speedup (2.1x on top of the map optimization). The map lookup was indeed the dominant cost after fixing the allocation pattern — 169M hash+probe ops at ~20ns each.

sync.Pool benefit is mostly allocation reduction (901K → 23K allocs on reuse), not wall-clock. For kubernetes (3 shards) the wall-clock gain is marginal; it matters more for repos that produce many small shards.

End-to-end via hyperfine with zoekt-git-index on kubernetes:

Without ctags: 11.4s → 7.9s (1.45x)
With ctags:    20.5s → 16.6s (1.23x)

Shard output is byte-identical (only metadata timestamp differs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants