Skip to content

gitindex: replace go-git blob reading with pipelined git cat-file --batch#1021

Open
clemlesne wants to merge 2 commits intosourcegraph:mainfrom
clemlesne:catfile-batch
Open

gitindex: replace go-git blob reading with pipelined git cat-file --batch#1021
clemlesne wants to merge 2 commits intosourcegraph:mainfrom
clemlesne:catfile-batch

Conversation

@clemlesne
Copy link

@clemlesne clemlesne commented Mar 20, 2026

Summary

  • Replace serial go-git BlobObject calls in indexGitRepo with a single pipelined git cat-file --batch --buffer subprocess
  • A writer goroutine feeds all blob SHAs to stdin while the main goroutine reads responses from stdout, forming a concurrent pipeline
  • --buffer switches git from per-object write_or_die flush to libc stdio buffering, reducing syscalls
  • Submodule blobs fall back to the existing go-git createDocument path

Benchmarks

On kubernetes (29,188 files, 261 MB content), Apple M1 Max, median of 5 runs:

go-git (before) cat-file pipelined (after) Change
Time 3.05s 0.62s -80% (4.9x)
Allocs 685K 58K -92%
Memory 670 MB 277 MB -59%

Reproducing:

git clone --depth=1 https://github.com/kubernetes/kubernetes /tmp/k8s
ZOEKT_BENCH_REPO=/tmp/k8s go test ./gitindex/ -bench='BenchmarkBlobRead_(GoGit|CatfilePipelined)/files=29188' -benchmem -count=5 -timeout=600s

Context

Filed as #1016. seek wraps zoekt as a local CLI for AI coding agents — every search is a blocking tool call, so cold-index speed matters.

Prior art: PR #424 benchmarked git cat-file --batch at 2.68x faster than go-git. PR #940 states "eventually we want to move away from go-git towards a vanilla direct git usage."

Design

The pipelined reader (readBlobsPipelined in gitindex/catfile.go):

  1. Starts git cat-file --batch --buffer with the repo as working directory
  2. Writer goroutine feeds all blob SHAs via a 64KB bufio.Writer, then closes stdin to trigger git's final fflush(stdout)
  3. Main goroutine reads responses in order via a 512KB bufio.Reader, parsing the <oid> <type> <size>\n<content>\n protocol
  4. On early exit (read error), drains stdout so git can flush and exit, then waits for the writer goroutine and cmd.Wait()
  5. Missing objects are marked via blobResult.Missing (matches existing ErrObjectNotFoundskippedLargeDoc behavior)

Integration in indexGitRepo (gitindex/index.go):

  1. Separates main-repo keys (SubRepoPath == "") from submodule keys
  2. Collects blob SHAs in document-sorted order, calls readBlobsPipelined once
  3. Builds index.Document from pipelined results for main-repo blobs
  4. Falls back to go-git createDocument for submodule blobs

Per-blob allocations minimized to 1 (the content []byte itself) via stack-allocated hex buffers, ReadBytes for headers, and LastIndexByte for size parsing.

Test plan

  • go test ./gitindex/ -race passes (all existing tests + 2 new)
  • go vet ./gitindex/ clean
  • golangci-lint run ./gitindex/ — no new issues
  • New unit tests: TestReadBlobsPipelined (normal, empty, binary, large blobs), TestReadBlobsPipelined_WithMissing
  • New benchmarks: BenchmarkBlobRead_GoGit (baseline), BenchmarkBlobRead_CatfilePipelined (production path)
  • Benchmark on a second large repo (e.g. torvalds/linux)

🤖 Generated with Claude Code

…atch

Replace the serial go-git BlobObject calls in indexGitRepo with a single
pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine
feeds all blob SHAs to stdin while the main goroutine reads responses
from stdout, forming a concurrent pipeline that eliminates per-object
packfile seek overhead and leverages git's internal delta base cache.

Submodule blobs fall back to the existing go-git createDocument path.

Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs:

  go-git BlobObject (before):
    Time:   2.94s     Allocs: 685K    Memory: 691 MB

  cat-file pipelined (after):
    Time:   0.60s     Allocs: 58K     Memory: 276 MB

  Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff. See inline, but the reason I am requesting changes is I want to avoid reading into memory files that we decide to not index due to being too large. I believe you should be able to shape the code written here into something which allows that.

Comment on lines +588 to +595
// Build the ordered list of (name, key) pairs and collect blob SHAs for
// the main repo so we can read them all at once via git cat-file --batch.
type indexEntry struct {
key fileKey
blobIndex int // index into blobResults, or -1 for submodule blobs
}
entries := make([]indexEntry, 0, totalFiles)
mainRepoIDs := make([]plumbing.Hash, 0, totalFiles)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to rather do two loops. One over mainRepoIDs and instead have another list for submoduleIDs. The actual code is quite minimal for submoduleIDs and the duplication then between what we do once we have created a doc is minimal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — split into two separate loops. mainRepoKeys iterated with the streaming reader, submoduleKeys iterated with go-git createDocument. The builder sorts documents internally via sortDocuments so ordering between the two loops does not matter.

var blobResults []blobResult
if len(mainRepoIDs) > 0 {
var err error
blobResults, err = readBlobsPipelined(opts.RepoDir, mainRepoIDs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem with this current API in that you read everything into memory. This just won't scale for very large repos. It's part of why zoekt works with this design of adding documents that at certain thresholds decide to create a shard (which then rely on the GC to free memory).

Additionally previously if a blob was over our max size, we just wouldn't read it in to memory.

I would suggest instead designing the API to be like the archive/tar API. So you would still pass in a huge slice of mainRepoIDs. But it would then return a stateful struct over the output from the process. For each file we get a little bit of metadata (like size, etc) and then the caller here can read that into a []byte or call .Next to get it to skip over the file.

So I think to keep the nice property around just passing in all IDs, we still get git to output all objects. But here in go land we decide between actually reading the object into a []byte or just io.Discarding it.

You don't have to go crazy on the design of the API. Just do something nice and minimal to suit this use case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, reworked. readBlobsPipelined is replaced by a streaming catfileReaderarchive/tar.Reader-style:

cr, _ := newCatfileReader(opts.RepoDir, mainRepoIDs)
for _, key := range mainRepoKeys {
    size, missing, _ := cr.Next()
    if size > SizeMax { continue } // unread bytes auto-discarded by next Next()
    content := make([]byte, size)
    io.ReadFull(cr, content)
}
  • Next() reads the header and returns size — content is not read until the caller calls Read
  • Calling Next() again without reading discards unread bytes via bufio.Reader.Discard
  • Large blobs never touch Go memory — only the header is parsed, content is discarded in the bufio buffer
  • Peak memory is now bounded by ShardMax (one shard's worth of content), not total repo size

Replace the bulk readBlobsPipelined (which read all blobs into a
[]blobResult slice) with a streaming catfileReader modeled after
archive/tar.Reader:

  cr, _ := newCatfileReader(repoDir, ids)
  for {
      size, missing, err := cr.Next()
      if size > maxSize { continue } // auto-skipped, never read
      content := make([]byte, size)
      io.ReadFull(cr, content)
  }

Next() reads the cat-file header and returns the blob's size. The
caller decides whether to Read the content or skip it — calling Next()
again automatically discards unread bytes via bufio.Reader.Discard.
Large blobs over SizeMax are never allocated or read into Go memory.

Also split the single interleaved loop into two: one for main-repo
blobs streamed via cat-file, one for submodule blobs via go-git's
createDocument. The builder sorts documents internally so ordering
between the loops does not matter.

Peak memory is now bounded by ShardMax (one shard's worth of content)
rather than total repository size.
@clemlesne
Copy link
Author

Updated benchmarks after the streaming rewrite (median of 5 runs, Apple M1 Max, kubernetes 29,188 files):

go-git (before) catfileReader (after) Change
Time 2.41s 0.52s -78% (4.6x)
Allocs 690K 58K -92%
Memory 751 MB 275 MB -63%

The streaming API is faster than the previous bulk []blobResult approach (0.62s → 0.52s) because there's no intermediate slice — content goes directly from the pipe into the caller's buffer. And large blobs over SizeMax are now skipped without reading content into memory at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants