scalar: Install prefetch packfiles in parallel#876
scalar: Install prefetch packfiles in parallel#876derrickstolee wants to merge 2 commits intomicrosoft:vfs-2.53.0from
Conversation
Refactor install_prefetch() to process prefetch packs in two distinct phases: Phase 1 (extraction): Read the multipack stream sequentially, copying each packfile to its own temp file and recording its checksum and timestamp in a prefetch_entry array. This must be sequential because the multipack is a single byte stream. Phase 2 (indexing): Run 'git index-pack' on each extracted temp file and finalize it into the ODB. Today this still runs sequentially, but the separation makes it straightforward to parallelize in a subsequent commit. The new extract_packfile_from_multipack() only does I/O against the multipack fd plus temp-file creation. The new index_and_finalize_packfile() only does the index-pack and rename work. Neither depends on the other's state, so they can operate on different entries concurrently once the extraction phase completes. No behavioral change; this is a pure refactor. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Replace the sequential index-pack loop in install_prefetch() with run_processes_parallel(), spawning up to four concurrent 'git index-pack' workers. The packfiles are already ordered by timestamp (oldest first) in the multipack response. In the common fresh-clone scenario the oldest pack is by far the largest, so it starts indexing immediately on the first worker while the remaining workers cycle through the smaller daily and hourly packs. Note that this works for the GVFS prefetch endpoint as all prefetch packfiles are non-thin packs. The bundle URI feature uses thin bundles that must be unpacked sequentially. Worker count is min(np, PREFETCH_MAX_WORKERS) where PREFETCH_MAX_WORKERS is 4, so we never create more workers than there are packfiles. When there is only a single packfile the parallel infrastructure is skipped entirely and index-pack runs directly. The default grouped mode of run_processes_parallel() is used so that child-process completion is detected via poll() on stderr pipes rather than the ungroup mode's aggressive mark-all-slots-WAIT_CLEANUP approach, which can misfire on slots that never started a process. The run_processes_parallel() callbacks are always invoked from the main thread, so finalize_prefetch_packfile() (which renames files into the ODB) needs no locking. If any index-pack fails, the error is recorded and remaining tasks still complete so that successfully-indexed packs are not lost. I performed manual performance testing on Linux using an internal monorepo. I deleted a set of recent prefetch packfiles, leading to a download of a couple daily packfiles and several hourly packfiles. This led to an improvement from 85.2 seconds to 40.3 seconds. Signed-off-by: Derrick Stolee <stolee@gmail.com>
|
Excellent change. From what I've learned implementing similar behavior in VFSForGit, adding just 1 parallel thread provides the vast majority of the gain because of the structure of the pack files provided by the cache servers - there is 1 very large file followed by perhaps a dozen much smaller files. 1 separate worker handling the smaller files can typically index them all while the large file is still in its single-threaded analysis phase. If the cache server format changes in the future (e.g., to limit files to 1GB instead of having one "everything older than 3 months" file) then more parallelism could have a bigger effect. |
While that structure is true for fresh clones, I was showing performance gains even when fetching only the last few days of packfiles. This will improve even incremental And you're right that we may benefit from some further gains by rearranging our prefetch packfile structure, such as breaking the "everything" packfile into smaller chunks. Perhaps monthly or yearly packs as a "maximum time" interval. |
When using Scalar clones with microsoft/git against Azure DevOps and GVFS Cache Servers,
git fetchwill download potentially multiple precomputed prefetch packfiles. The current mechanism indexes these files sequentially.Let's make those
git index-packprocesses run somewhat in parallel.For now, I've chosen to have a maximum of four parallel processes to limit the potential load on the disk. However, this already has some significant gains. When testing an internal monorepo (that uses Codespaces, for easy Linux testing) and deleting a few days of recent prefetch packfiles, the end-to-end
git fetchtime improved as follows:newoldWhen downloading fewer prefetch packfiles, the improvement is still relevant:
newoldI should mention that I first tried streaming data directly from the curl download into a sequence of
git index-packprocesses, but that did not make any serious difference in the performance. Based on these numbers, we are clearly blocked on the CPU time spent computing deltas and evaluating object hashes and not blocked on the "download to disk, then index from disk" I/O.I think it would be worthwhile to do some performance testing on Windows, at minimum, before merging this change. I'd like to get some feedback on the concept before going through those actions.
Another question to ask is whether it is worth making this behavior configurable: should it be possible to disable parallel indexing in favor of a sequential process if a certain config option is set? Should we allow increasing the parallelism via config?