gitindex: replace go-git blob reading with pipelined git cat-file --batch by clemlesne · Pull Request #1021 · sourcegraph/zoekt

clemlesne · 2026-03-20T13:34:24Z

Summary

Replace serial go-git BlobObject calls in indexGitRepo with a single pipelined git cat-file --batch --buffer subprocess
A writer goroutine feeds all blob SHAs to stdin while the main goroutine reads responses from stdout, forming a concurrent pipeline
--buffer switches git from per-object write_or_die flush to libc stdio buffering, reducing syscalls
Submodule blobs fall back to the existing go-git createDocument path

Benchmarks

On kubernetes (29,188 files, 261 MB content), Apple M1 Max, median of 5 runs:

	go-git (before)	cat-file pipelined (after)	Change
Time	3.05s	0.62s	-80% (4.9x)
Allocs	685K	58K	-92%
Memory	670 MB	277 MB	-59%

Reproducing:

git clone --depth=1 https://github.com/kubernetes/kubernetes /tmp/k8s
ZOEKT_BENCH_REPO=/tmp/k8s go test ./gitindex/ -bench='BenchmarkBlobRead_(GoGit|CatfilePipelined)/files=29188' -benchmem -count=5 -timeout=600s

Context

Filed as #1016. seek wraps zoekt as a local CLI for AI coding agents — every search is a blocking tool call, so cold-index speed matters.

Prior art: PR #424 benchmarked git cat-file --batch at 2.68x faster than go-git. PR #940 states "eventually we want to move away from go-git towards a vanilla direct git usage."

Design

The pipelined reader (readBlobsPipelined in gitindex/catfile.go):

Starts git cat-file --batch --buffer with the repo as working directory
Writer goroutine feeds all blob SHAs via a 64KB bufio.Writer, then closes stdin to trigger git's final fflush(stdout)
Main goroutine reads responses in order via a 512KB bufio.Reader, parsing the <oid> <type> <size>\n<content>\n protocol
On early exit (read error), drains stdout so git can flush and exit, then waits for the writer goroutine and cmd.Wait()
Missing objects are marked via blobResult.Missing (matches existing ErrObjectNotFound → skippedLargeDoc behavior)

Integration in indexGitRepo (gitindex/index.go):

Separates main-repo keys (SubRepoPath == "") from submodule keys
Collects blob SHAs in document-sorted order, calls readBlobsPipelined once
Builds index.Document from pipelined results for main-repo blobs
Falls back to go-git createDocument for submodule blobs

Per-blob allocations minimized to 1 (the content []byte itself) via stack-allocated hex buffers, ReadBytes for headers, and LastIndexByte for size parsing.

Test plan

go test ./gitindex/ -race passes (all existing tests + 2 new)
go vet ./gitindex/ clean
golangci-lint run ./gitindex/ — no new issues
New unit tests: TestReadBlobsPipelined (normal, empty, binary, large blobs), TestReadBlobsPipelined_WithMissing
New benchmarks: BenchmarkBlobRead_GoGit (baseline), BenchmarkBlobRead_CatfilePipelined (production path)
Benchmark on a second large repo (e.g. torvalds/linux)

🤖 Generated with Claude Code

…atch Replace the serial go-git BlobObject calls in indexGitRepo with a single pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine feeds all blob SHAs to stdin while the main goroutine reads responses from stdout, forming a concurrent pipeline that eliminates per-object packfile seek overhead and leverages git's internal delta base cache. Submodule blobs fall back to the existing go-git createDocument path. Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs: go-git BlobObject (before): Time: 2.94s Allocs: 685K Memory: 691 MB cat-file pipelined (after): Time: 0.60s Allocs: 58K Memory: 276 MB Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory

keegancsmith

Awesome stuff. See inline, but the reason I am requesting changes is I want to avoid reading into memory files that we decide to not index due to being too large. I believe you should be able to shape the code written here into something which allows that.

keegancsmith · 2026-03-20T14:53:32Z

gitindex/index.go

+	// Build the ordered list of (name, key) pairs and collect blob SHAs for
+	// the main repo so we can read them all at once via git cat-file --batch.
+	type indexEntry struct {
+		key       fileKey
+		blobIndex int // index into blobResults, or -1 for submodule blobs
+	}
+	entries := make([]indexEntry, 0, totalFiles)
+	mainRepoIDs := make([]plumbing.Hash, 0, totalFiles)


I think it makes sense to rather do two loops. One over mainRepoIDs and instead have another list for submoduleIDs. The actual code is quite minimal for submoduleIDs and the duplication then between what we do once we have created a doc is minimal.

Done — split into two separate loops. mainRepoKeys iterated with the streaming reader, submoduleKeys iterated with go-git createDocument. The builder sorts documents internally via sortDocuments so ordering between the two loops does not matter.

keegancsmith · 2026-03-20T14:57:17Z

gitindex/index.go

+	var blobResults []blobResult
+	if len(mainRepoIDs) > 0 {
+		var err error
+		blobResults, err = readBlobsPipelined(opts.RepoDir, mainRepoIDs)


I think there is a problem with this current API in that you read everything into memory. This just won't scale for very large repos. It's part of why zoekt works with this design of adding documents that at certain thresholds decide to create a shard (which then rely on the GC to free memory).

Additionally previously if a blob was over our max size, we just wouldn't read it in to memory.

I would suggest instead designing the API to be like the archive/tar API. So you would still pass in a huge slice of mainRepoIDs. But it would then return a stateful struct over the output from the process. For each file we get a little bit of metadata (like size, etc) and then the caller here can read that into a []byte or call .Next to get it to skip over the file.

So I think to keep the nice property around just passing in all IDs, we still get git to output all objects. But here in go land we decide between actually reading the object into a []byte or just io.Discarding it.

You don't have to go crazy on the design of the API. Just do something nice and minimal to suit this use case.

Agreed, reworked. readBlobsPipelined is replaced by a streaming catfileReader — archive/tar.Reader-style:

cr, _ := newCatfileReader(opts.RepoDir, mainRepoIDs) for _, key := range mainRepoKeys { size, missing, _ := cr.Next() if size > SizeMax { continue } // unread bytes auto-discarded by next Next() content := make([]byte, size) io.ReadFull(cr, content) }

Next() reads the header and returns size — content is not read until the caller calls Read

Calling Next() again without reading discards unread bytes via bufio.Reader.Discard

Large blobs never touch Go memory — only the header is parsed, content is discarded in the bufio buffer

Peak memory is now bounded by ShardMax (one shard's worth of content), not total repo size

Replace the bulk readBlobsPipelined (which read all blobs into a []blobResult slice) with a streaming catfileReader modeled after archive/tar.Reader: cr, _ := newCatfileReader(repoDir, ids) for { size, missing, err := cr.Next() if size > maxSize { continue } // auto-skipped, never read content := make([]byte, size) io.ReadFull(cr, content) } Next() reads the cat-file header and returns the blob's size. The caller decides whether to Read the content or skip it — calling Next() again automatically discards unread bytes via bufio.Reader.Discard. Large blobs over SizeMax are never allocated or read into Go memory. Also split the single interleaved loop into two: one for main-repo blobs streamed via cat-file, one for submodule blobs via go-git's createDocument. The builder sorts documents internally so ordering between the loops does not matter. Peak memory is now bounded by ShardMax (one shard's worth of content) rather than total repository size.

clemlesne · 2026-03-20T17:11:52Z

Updated benchmarks after the streaming rewrite (median of 5 runs, Apple M1 Max, kubernetes 29,188 files):

	go-git (before)	catfileReader (after)	Change
Time	2.41s	0.52s	-78% (4.6x)
Allocs	690K	58K	-92%
Memory	751 MB	275 MB	-63%

The streaming API is faster than the previous bulk []blobResult approach (0.62s → 0.52s) because there's no intermediate slice — content goes directly from the pipe into the caller's buffer. And large blobs over SizeMax are now skipped without reading content into memory at all.

clemlesne mentioned this pull request Mar 20, 2026

Replace go-git blob reading with git cat-file --batch for 2-3x faster indexing #1016

Open

clemlesne marked this pull request as ready for review March 20, 2026 13:38

keegancsmith requested changes Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gitindex: replace go-git blob reading with pipelined git cat-file --batch#1021

gitindex: replace go-git blob reading with pipelined git cat-file --batch#1021
clemlesne wants to merge 2 commits intosourcegraph:mainfrom
clemlesne:catfile-batch

clemlesne commented Mar 20, 2026 •

edited

Loading

Uh oh!

keegancsmith left a comment

Uh oh!

keegancsmith Mar 20, 2026

Uh oh!

clemlesne Mar 20, 2026

Uh oh!

keegancsmith Mar 20, 2026

Uh oh!

clemlesne Mar 20, 2026

Uh oh!

clemlesne commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

clemlesne commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks

Context

Design

Test plan

Uh oh!

keegancsmith left a comment

Choose a reason for hiding this comment

Uh oh!

keegancsmith Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

clemlesne Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

keegancsmith Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

clemlesne Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

clemlesne commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clemlesne commented Mar 20, 2026 •

edited

Loading