Add non-power-of-2 shapes for Morton coding to benchmarks#3717
Add non-power-of-2 shapes for Morton coding to benchmarks#3717mkitti wants to merge 4 commits intozarr-developers:mainfrom
Conversation
Add (30,30,30) to large_morton_shards and (10,10,10), (20,20,20), (30,30,30) to morton_iter_shapes to benchmark the scalar fallback path for non-power-of-2 shapes, which are not fully covered by the vectorized hypercube path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmark ResultsThese benchmarks were run on this branch (which includes the vectorized
|
| Shape | Elements | Type | Mean time |
|---|---|---|---|
(8,8,8) |
512 | power-of-2 | 0.45 ms |
(16,16,16) |
4,096 | power-of-2 | 3.6 ms |
(32,32,32) |
32,768 | power-of-2 | 28.9 ms |
(10,10,10) |
1,000 | non-power-of-2 | 9.6 ms |
(20,20,20) |
8,000 | non-power-of-2 | 88.2 ms |
(30,30,30) |
27,000 | non-power-of-2 | 125.6 ms |
(33,33,33) |
35,937 | near-miss (+1 above 32³) | 767 ms |
The near-miss penalty is striking: (33,33,33) has only ~10% more elements than (32,32,32) but takes 27× longer. This is because the current floor-hypercube approach must scalar-decode many Morton codes beyond the guaranteed in-bounds region.
test_sharded_morton_write_single_chunk — write 1 chunk to a large shard, cache cleared each round
| Shape | Chunks/shard | Mean time |
|---|---|---|
(32,32,32) |
32,768 | 35.7 ms |
(30,30,30) |
27,000 | 127.5 ms |
(33,33,33) |
35,937 | 767.8 ms |
test_sharded_morton_single_chunk — read 1 chunk from a large shard (cached after first access)
| Shape | Mean time |
|---|---|
(32,32,32) |
0.73 ms |
(30,30,30) |
0.69 ms |
(33,33,33) |
0.71 ms |
Reads are fast across all shapes once the Morton order cache is warm (the first call pays the penalty, subsequent reads are cached).
Interpretation
The benchmarks confirm that non-power-of-2 shard shapes carry a significant Morton computation penalty under the current implementation, with near-miss shapes (like (33,33,33)) being especially slow. These benchmarks provide a baseline to measure improvements from follow-on optimization work.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
benchmarks
[Description of PR]
TODO:
docs/user-guide/*.mdchanges/