gitindex: replace go-git blob reading with pipelined git cat-file --batch#1021
gitindex: replace go-git blob reading with pipelined git cat-file --batch#1021clemlesne wants to merge 2 commits intosourcegraph:mainfrom
Conversation
…atch
Replace the serial go-git BlobObject calls in indexGitRepo with a single
pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine
feeds all blob SHAs to stdin while the main goroutine reads responses
from stdout, forming a concurrent pipeline that eliminates per-object
packfile seek overhead and leverages git's internal delta base cache.
Submodule blobs fall back to the existing go-git createDocument path.
Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs:
go-git BlobObject (before):
Time: 2.94s Allocs: 685K Memory: 691 MB
cat-file pipelined (after):
Time: 0.60s Allocs: 58K Memory: 276 MB
Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory
keegancsmith
left a comment
There was a problem hiding this comment.
Awesome stuff. See inline, but the reason I am requesting changes is I want to avoid reading into memory files that we decide to not index due to being too large. I believe you should be able to shape the code written here into something which allows that.
gitindex/index.go
Outdated
| // Build the ordered list of (name, key) pairs and collect blob SHAs for | ||
| // the main repo so we can read them all at once via git cat-file --batch. | ||
| type indexEntry struct { | ||
| key fileKey | ||
| blobIndex int // index into blobResults, or -1 for submodule blobs | ||
| } | ||
| entries := make([]indexEntry, 0, totalFiles) | ||
| mainRepoIDs := make([]plumbing.Hash, 0, totalFiles) |
There was a problem hiding this comment.
I think it makes sense to rather do two loops. One over mainRepoIDs and instead have another list for submoduleIDs. The actual code is quite minimal for submoduleIDs and the duplication then between what we do once we have created a doc is minimal.
There was a problem hiding this comment.
Done — split into two separate loops. mainRepoKeys iterated with the streaming reader, submoduleKeys iterated with go-git createDocument. The builder sorts documents internally via sortDocuments so ordering between the two loops does not matter.
gitindex/index.go
Outdated
| var blobResults []blobResult | ||
| if len(mainRepoIDs) > 0 { | ||
| var err error | ||
| blobResults, err = readBlobsPipelined(opts.RepoDir, mainRepoIDs) |
There was a problem hiding this comment.
I think there is a problem with this current API in that you read everything into memory. This just won't scale for very large repos. It's part of why zoekt works with this design of adding documents that at certain thresholds decide to create a shard (which then rely on the GC to free memory).
Additionally previously if a blob was over our max size, we just wouldn't read it in to memory.
I would suggest instead designing the API to be like the archive/tar API. So you would still pass in a huge slice of mainRepoIDs. But it would then return a stateful struct over the output from the process. For each file we get a little bit of metadata (like size, etc) and then the caller here can read that into a []byte or call .Next to get it to skip over the file.
So I think to keep the nice property around just passing in all IDs, we still get git to output all objects. But here in go land we decide between actually reading the object into a []byte or just io.Discarding it.
You don't have to go crazy on the design of the API. Just do something nice and minimal to suit this use case.
There was a problem hiding this comment.
Agreed, reworked. readBlobsPipelined is replaced by a streaming catfileReader — archive/tar.Reader-style:
cr, _ := newCatfileReader(opts.RepoDir, mainRepoIDs)
for _, key := range mainRepoKeys {
size, missing, _ := cr.Next()
if size > SizeMax { continue } // unread bytes auto-discarded by next Next()
content := make([]byte, size)
io.ReadFull(cr, content)
}Next()reads the header and returns size — content is not read until the caller callsRead- Calling
Next()again without reading discards unread bytes viabufio.Reader.Discard - Large blobs never touch Go memory — only the header is parsed, content is discarded in the bufio buffer
- Peak memory is now bounded by
ShardMax(one shard's worth of content), not total repo size
Replace the bulk readBlobsPipelined (which read all blobs into a
[]blobResult slice) with a streaming catfileReader modeled after
archive/tar.Reader:
cr, _ := newCatfileReader(repoDir, ids)
for {
size, missing, err := cr.Next()
if size > maxSize { continue } // auto-skipped, never read
content := make([]byte, size)
io.ReadFull(cr, content)
}
Next() reads the cat-file header and returns the blob's size. The
caller decides whether to Read the content or skip it — calling Next()
again automatically discards unread bytes via bufio.Reader.Discard.
Large blobs over SizeMax are never allocated or read into Go memory.
Also split the single interleaved loop into two: one for main-repo
blobs streamed via cat-file, one for submodule blobs via go-git's
createDocument. The builder sorts documents internally so ordering
between the loops does not matter.
Peak memory is now bounded by ShardMax (one shard's worth of content)
rather than total repository size.
|
Updated benchmarks after the streaming rewrite (median of 5 runs, Apple M1 Max, kubernetes 29,188 files):
The streaming API is faster than the previous bulk |
Summary
go-gitBlobObjectcalls inindexGitRepowith a single pipelinedgit cat-file --batch --buffersubprocess--bufferswitches git from per-objectwrite_or_dieflush to libc stdio buffering, reducing syscallscreateDocumentpathBenchmarks
On kubernetes (29,188 files, 261 MB content), Apple M1 Max, median of 5 runs:
Reproducing:
Context
Filed as #1016. seek wraps zoekt as a local CLI for AI coding agents — every search is a blocking tool call, so cold-index speed matters.
Prior art: PR #424 benchmarked
git cat-file --batchat 2.68x faster than go-git. PR #940 states "eventually we want to move away from go-git towards a vanilla direct git usage."Design
The pipelined reader (
readBlobsPipelinedingitindex/catfile.go):git cat-file --batch --bufferwith the repo as working directorybufio.Writer, then closes stdin to trigger git's finalfflush(stdout)bufio.Reader, parsing the<oid> <type> <size>\n<content>\nprotocolcmd.Wait()blobResult.Missing(matches existingErrObjectNotFound→skippedLargeDocbehavior)Integration in
indexGitRepo(gitindex/index.go):SubRepoPath == "") from submodule keysreadBlobsPipelinedonceindex.Documentfrom pipelined results for main-repo blobscreateDocumentfor submodule blobsPer-blob allocations minimized to 1 (the content
[]byteitself) via stack-allocated hex buffers,ReadBytesfor headers, andLastIndexBytefor size parsing.Test plan
go test ./gitindex/ -racepasses (all existing tests + 2 new)go vet ./gitindex/cleangolangci-lint run ./gitindex/— no new issuesTestReadBlobsPipelined(normal, empty, binary, large blobs),TestReadBlobsPipelined_WithMissingBenchmarkBlobRead_GoGit(baseline),BenchmarkBlobRead_CatfilePipelined(production path)torvalds/linux)🤖 Generated with Claude Code