Skip to content

Commit 787cd1c

Browse files
committed
Revise paper: emphasize metaprogramming, add comparison table, fix decorator examples
1 parent d39705f commit 787cd1c

1 file changed

Lines changed: 25 additions & 34 deletions

File tree

paper.md

Lines changed: 25 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: 'arraybridge: Unified Array Conversion Across Five GPU Frameworks and NumPy with OOM Recovery and Thread-Local Streams'
2+
title: 'arraybridge: Zero-Copy GPU Array Conversion with OOM Recovery Across Six Frameworks'
33
tags:
44
- Python
55
- GPU computing
@@ -30,13 +30,16 @@ tensorflow_to_numpy = data.numpy()
3030
pyclesperanto_to_numpy = cle.pull(data)
3131
```
3232

33-
With arraybridge:
33+
With arraybridge, users declare their function's memory type via decorators:
3434

3535
```python
36-
result = convert_memory(data, source_type='cupy', target_type='torch', gpu_id=0)
36+
@cupy
37+
def sobel_2d(image):
38+
"""Edge detection on GPU with automatic OOM recovery."""
39+
return cp.gradient(image)
3740
```
3841

39-
The library handles DLPack zero-copy transfers when available, falls back to NumPy bridging otherwise, manages per-thread CUDA streams for safe parallelization, detects and recovers from GPU out-of-memory errors across all frameworks, and preserves dtypes through conversions preventing silent precision loss.
42+
The decorator handles DLPack zero-copy transfers when available, falls back to NumPy bridging otherwise, manages per-thread CUDA streams for safe parallelization, detects and recovers from GPU out-of-memory errors, and preserves dtypes through conversions.
4043

4144
# Statement of Need
4245

@@ -51,6 +54,15 @@ The problem is not just syntax—each framework has different:
5154

5255
Writing correct multi-framework code requires understanding all these differences. `arraybridge` consolidates this knowledge into a single declarative configuration, generating 36 conversion methods (6 frameworks × 6 target types) from ~450 lines of configuration rather than hand-written code for each path.
5356

57+
| Feature | DLPack | Framework utils | arraybridge |
58+
|---------|:------:|:---------------:|:-----------:|
59+
| Zero-copy transfer ||||
60+
| OOM recovery ||||
61+
| Thread-local streams ||||
62+
| Dtype preservation ||||
63+
| Unified API (36 paths) ||||
64+
| Automatic fallback ||||
65+
5466
# State of the Field
5567

5668
**DLPack** [@dlpack] provides a zero-copy tensor sharing protocol adopted by all major frameworks. However, DLPack handles only the data transfer—users must still detect framework types, handle fallbacks when DLPack fails, manage device placement, and deal with framework-specific exceptions.
@@ -59,16 +71,9 @@ Writing correct multi-framework code requires understanding all these difference
5971

6072
**Framework-specific utilities** (`torch.from_numpy()`, `cupy.asarray()`) handle only their own framework pairs. They provide no OOM recovery, no stream management, and no dtype preservation guarantees.
6173

62-
`arraybridge` differs in four key areas:
63-
64-
1. **Unified conversion API**: Single function for all 36 source/target combinations
65-
2. **Automatic OOM recovery**: Detects framework-specific exception types and string patterns, clears caches, retries
66-
3. **Thread-local CUDA streams**: Each thread gets its own stream, enabling true parallel GPU execution
67-
4. **Dtype preservation**: Scales floating-point results to integer ranges when converting back to integer dtypes, preventing silent precision loss
68-
6974
# Software Design
7075

71-
The architecture is data-driven. All framework-specific behavior is defined in `_FRAMEWORK_CONFIG`:
76+
The architecture uses data-driven metaprogramming. All framework-specific behavior is declared in `_FRAMEWORK_CONFIG`:
7277

7378
```python
7479
_FRAMEWORK_CONFIG = {
@@ -82,33 +87,21 @@ _FRAMEWORK_CONFIG = {
8287
"oom_clear_cache": "{mod}.get_default_memory_pool().free_all_blocks()",
8388
"stream_context": "{mod}.cuda.Stream()",
8489
},
85-
MemoryType.TORCH: {
86-
"conversion_ops": {
87-
"to_numpy": "data.cpu().numpy()",
88-
"from_numpy": "{mod}.from_numpy(data).cuda(gpu_id)",
89-
"from_dlpack": "{mod}.from_dlpack(data)",
90-
},
91-
"oom_exception_types": ["{mod}.cuda.OutOfMemoryError"],
92-
"oom_clear_cache": "{mod}.cuda.empty_cache()",
93-
"stream_context": "{mod}.cuda.Stream()",
94-
},
95-
# Similar entries for TENSORFLOW, JAX, PYCLESPERANTO, NUMPY
90+
# Similar entries for TORCH, TENSORFLOW, JAX, PYCLESPERANTO, NUMPY
9691
}
9792
```
9893

99-
At import time, converter classes are generated dynamically via `AutoRegisterMeta` from the `metaclass-registry` library. Each converter implements `to_numpy()`, `from_numpy()`, `from_dlpack()`, and `to_X()` methods for all target frameworks. The metaclass auto-registers each converter by its `memory_type` attribute, eliminating manual registration.
94+
At import time, converter classes are generated dynamically via `AutoRegisterMeta` from the `metaclass-registry` library [@metaclassregistry]. Each converter implements `to_numpy()`, `from_numpy()`, `from_dlpack()`, and `to_X()` methods for all target frameworks. Adding a seventh framework requires only adding its entry to `_FRAMEWORK_CONFIG`—no new code paths.
10095

101-
**Thread-local GPU streams** are managed via `threading.local()`. The `@cupy` and `@torch` decorators automatically create per-thread CUDA streams:
96+
**Thread-local GPU streams** are managed via `threading.local()`. Without this, multiple threads sharing a GPU would serialize on the default stream or corrupt each other's operations. With per-thread streams, true parallel GPU execution is possible:
10297

10398
```python
104-
@torch(oom_recovery=True)
99+
@torch
105100
def segment_image(image):
106101
return model(image) # Runs on thread-local stream
107102
```
108103

109-
This enables true parallelization—multiple threads can execute GPU operations simultaneously without stream conflicts.
110-
111-
**OOM recovery** unifies detection across frameworks. The library checks both exception types and error string patterns (e.g., "out of memory", "resource_exhausted"), clears framework-specific caches, and retries:
104+
**OOM recovery** (enabled by default) unifies detection across frameworks. The library checks both exception types and error string patterns (e.g., "out of memory", "resource_exhausted"), clears framework-specific caches, and retries:
112105

113106
```python
114107
def _execute_with_oom_recovery(func, memory_type, max_retries=2):
@@ -127,13 +120,11 @@ def _execute_with_oom_recovery(func, memory_type, max_retries=2):
127120

128121
`arraybridge` is a core component of OpenHCS, an open-source platform for high-content screening microscopy. In OpenHCS pipelines:
129122

130-
- **GPU-accelerated stitching** (`ashlar_compute_tile_positions_gpu`) uses CuPy for phase correlation
131-
- **Flatfield correction** (`basic_flatfield_correction_cupy`) uses CuPy with OOM recovery and automatic fallback to CPU
132-
- **Edge detection** (`sobel_2d_vectorized`) uses CuPy with dtype preservation to maintain uint16 microscopy data
123+
- **GPU-accelerated stitching** uses CuPy for phase correlation
124+
- **Flatfield correction** uses CuPy with OOM recovery and automatic fallback to CPU
125+
- **Edge detection** uses CuPy with dtype preservation to maintain uint16 microscopy data
133126
- **Deep learning segmentation** integrates PyTorch models via the `@torch` decorator
134127

135-
The stack utilities (`stack_slices`, `unstack_slices`) enable efficient 3D volume processing where 2D slices are stacked to GPU, processed in parallel, and unstacked back to CPU. This pattern is used throughout OpenHCS for processing microscopy Z-stacks.
136-
137128
The thread-local stream management is critical for high-throughput screening where thousands of images must be processed per experiment. Multiple worker threads can process different images on the same GPU without coordination overhead.
138129

139130
# AI Usage Disclosure

0 commit comments

Comments
 (0)