Parallel access to b-tree and data via cat_ranges and threading#218
Draft
bnlawrence wants to merge 14 commits intomainfrom
Draft
Parallel access to b-tree and data via cat_ranges and threading#218bnlawrence wants to merge 14 commits intomainfrom
bnlawrence wants to merge 14 commits intomainfrom
Conversation
added 14 commits
November 18, 2025 08:09
Collaborator
Author
These results show the benefit of the parallelism for data reading, though they suggest one would not make the parallel b-tree read the default. Further investigation is necessary. Note that the POSIX results are not believable as they represent memory caching by the OS, as discussed here. Note that the ssh results are using `p5rem`, not `fsspec`. To what extent server side caching (for http and s3) is involved is not clear.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
It is clear that pyfive itself could benefit from internal parallelism. This idea was outlined in #154. Some detailed thinking and architecture design resulted in #216. This is the outcome of that work, and provides both parallel chunk reading and parallel reading of b-tree information. These are both turned on by default. The API to turn them off is somewhat obscure, and might be something to address in the discussion around this pull request.
This would close #209 and #216 (#154 has been already closed in anticipation).
Considerations:
The use of a mixin class for reading chunks. While concerns have been expressed, i think in the end, this is the right pattern, for now at least.
This retains a nearly complete separation of concerns between pyfive and the environment (POSIX, FSSPEC etc), but it is not perfect. Future work will need to address that, but the benefits of doing this now are so remarkable that it is worth doing it now, and foreshadowing the necessary work (an issue will be forthcoming in the next few days, and will link back here).
This replaces the previous pull request (First cut at adding some parallelism in pyfive #209).
Parallel decompression of chunks is postponed for future work.
Checklist