Skip to content

Numba/CUDA projection kernels for reproject, README update#1046

Open
brendancol wants to merge 35 commits intogeotiff-reader-writerfrom
issue-1045
Open

Numba/CUDA projection kernels for reproject, README update#1046
brendancol wants to merge 35 commits intogeotiff-reader-writerfrom
issue-1045

Conversation

@brendancol
Copy link
Contributor

@brendancol brendancol commented Mar 21, 2026

Closes #1045.

Summary

  • README narrative sections were stale ("implements common raster analysis functions," "documentation still under construction") -- updated to reflect the actual library
  • Fixed three bugs in reproject grid computation that caused shape mismatches vs rioxarray
  • Ported six PROJ projections to Numba (CPU) and CUDA (GPU), so common CRS pairs no longer go through pyproj

Reproject fixes

_grid.py had three problems that showed up when comparing output grids against rioxarray:

  1. Resolution was estimated by transforming a single diagonal step, coupling x and y through projection non-linearity. Now each axis is transformed independently and the geometric mean gives square pixels.
  2. Edge sampling used 21 points, which missed bounding box extent on curved projections. Now 101 edge points plus a 21x21 interior grid.
  3. ceil() turned 677.0000000000001 into 678. Switched to round().

Numba projection kernels

Six projections ported from the PROJ C library. Unsupported CRS pairs fall back to pyproj.

Projection EPSG examples CPU vs pyproj
Web Mercator 3857 5.7x
UTM (Krueger 6th-order) 326xx, 327xx, 269xx 7.8x
Ellipsoidal Mercator 3395 9.3x
Lambert Conformal Conic 2154, State Plane 9.3x
Albers Equal Area 5070 6.9x
Cylindrical Equal Area 6933 5.7x

CUDA kernels

Same projections as @cuda.jit(device=True) functions. Coordinates stay on-device instead of round-tripping through CPU.

Projection GPU (16.8M px) vs pyproj vs CPU Numba
Web Mercator 5ms 165x 27x
UTM 22ms 81x 10x
Ell. Mercator 24ms 111x 12x
LCC 35ms 85x 9x
Albers 30ms 41x 6x
CEA 8ms 110x 18x

All kernels are bit-exact against pyproj (max error < 5e-14 degrees).

Test plan

  • 67 existing reproject tests pass
  • Numba kernels validated point-by-point against pyproj (sub-nanometer)
  • CUDA kernels validated against CPU Numba (bit-exact)
  • Benchmark vs rioxarray: R^2 > 0.99999 on Landsat, Copernicus, USGS files
  • Verify README renders on GitHub

The intro, GDAL caveat, and docs note were written when the library
was much smaller. Now there are 100+ functions, native GeoTIFF I/O,
4-backend dispatch, and 33+ user guide notebooks. Updated the prose
to say what the library actually does instead of underselling it.
@github-actions github-actions bot added the performance PR touches performance-sensitive code label Mar 21, 2026
:gpu: doesn't render on GitHub -- switched to 🖥️.
Added a quick start snippet showing the read_geotiff -> reproject ->
write_geotiff workflow.
Switched quick start to 'import xrspatial as xrs', moved the
read/reproject/write flow into the main example using the .xrs
accessor, and simplified the standalone function example to use
the xrs namespace.
Three fixes to _grid.py:
- Resolution estimation now transforms each axis independently and
  uses the geometric mean for square pixels (was diagonal step)
- Edge sampling increased from 21 to 101 points plus a 21x21
  interior grid for better bounds coverage
- ceil() replaced with round() for dimension calculation to prevent
  floating-point noise from adding spurious rows/columns
Ports six projections from the PROJ library to Numba (CPU) and
Numba CUDA (GPU), bypassing pyproj for common CRS pairs:

- Web Mercator (EPSG:3857) -- spherical, 3 lines per direction
- Transverse Mercator / UTM (326xx, 327xx, 269xx) -- 6th-order
  Krueger series (Karney 2011), closed-form forward and inverse
- Ellipsoidal Mercator (EPSG:3395) -- Newton inverse
- Lambert Conformal Conic (e.g. EPSG:2154) -- Newton inverse
- Albers Equal Area (e.g. EPSG:5070) -- authalic latitude series
- Cylindrical Equal Area (e.g. EPSG:6933) -- authalic latitude series

CPU Numba kernels are 6-9x faster than pyproj. CUDA kernels are
40-165x faster. Unsupported CRS pairs fall back to pyproj.

_transform_coords now tries Numba first, then pyproj. The CuPy
chunk worker tries CUDA first, keeping coordinates on-device.
Compares xrspatial (approx, exact, numba) against rioxarray on
synthetic grids and real-world GeoTIFFs. Measures both performance
(median of 5 runs) and pixel-level consistency (RMSE, R^2, NaN
agreement) by forcing both libraries onto the same output grid.
Updated Reproject description and added a table listing the six
projections with native Numba CPU and CUDA GPU kernels.
@brendancol brendancol changed the title Update README narrative to match current scope Numba/CUDA projection kernels for reproject, README update Mar 21, 2026
Three new CPU Numba projections for bathymetric/ocean use cases:

- Sinusoidal (ellipsoidal): MODIS ocean/land products. Uses
  pj_mlfn meridional arc length with 5th-order series.
  Forward: sub-micrometer vs pyproj. Roundtrip: exact.

- Polar Stereographic (N/S): IBCAO Arctic (3996/3413),
  IBCSO Antarctic (3031), UPS. Iterative inverse (15 iter max).
  Forward: sub-nanometer. Roundtrip: exact.

LAEA implemented but disabled in dispatch pending investigation
of ~940m oblique-mode error (kernels are in the file for future
fix, just not wired into the dispatch table).
The oblique-mode LAEA had xmf and ymf swapped (rq/dd vs rq*dd).
Research agent found the bug by comparing against PROJ's laea.cpp
source. Forward accuracy is now sub-millimeter vs pyproj.
State Plane zones that use Transverse Mercator (Maine, New Hampshire,
Wisconsin, etc.) now hit the Numba fast path. Uses the same Krueger
6th-order series as UTM, with a Zb offset for non-zero lat_0.

Meter-based zones only; US survey foot zones fall back to pyproj.
State Plane zones in us-ft (136 zones) and ft (28 zones) now use
the Numba fast path. The Krueger/LCC kernels compute in metres
internally, then divide by the unit conversion factor (0.3048006
for us-ft, 0.3048 for ft) on output. x_0/y_0 from PROJ4 dicts
are already in metres regardless of the units parameter.
…#1045)

- Benchmark table showing Numba vs pyproj for all 12 supported projections
- Fixed dispatch to work with custom PROJ strings (no EPSG code),
  needed for Sinusoidal and other non-registered CRS definitions
- Fixed _utm_params to handle None epsg_code
All 12 projections now have GPU CUDA kernels. Performance on A6000:

- Sinusoidal: 18ms (56x vs pyproj)
- LAEA Europe: 18ms (92x)
- Polar Stere: 57ms (64-67x)
- State Plane tmerc: 23ms (88x)
- State Plane lcc ftUS: 36ms (124x)
- LCC France: 39ms (78x)

All bit-exact against CPU Numba kernels.
Updated README benchmark table and projection support matrix.
CRS on non-WGS84/GRS80 ellipsoids (Airy for BNG, Clarke 1866 for
NAD27, Bessel for Tokyo, etc.) now fall back to pyproj instead of
silently skipping the datum transformation. Without this, BNG had
~100m error, NAD27 ~24m, Tokyo ~900m.

Each _*_params() function now checks _is_wgs84_compatible_ellipsoid()
before returning parameters. EPSG-specific paths (UTM, Web Mercator)
were already safe since they only match WGS84/NAD83 codes.
NAD27 (EPSG:4267) sources and targets now go through the Numba fast
path. After the projection kernel runs in WGS84, a 3-parameter
geocentric Helmert transform (dx=-8, dy=160, dz=176 for Clarke 1866)
shifts coordinates to/from NAD27.

Accuracy: mean 2.9m, p95 5.8m vs pyproj across CONUS. This matches
PROJ's own accuracy when NADCON grids aren't installed. The framework
is extensible to other datums by adding entries to _DATUM_PARAMS.

Also broadened geographic CRS detection from WGS84/NAD83-only to
include NAD27, so NAD27 -> UTM and NAD27 -> State Plane dispatch
correctly.
Native CUDA nearest, bilinear, and cubic (Catmull-Rom) resampling
kernels replace cupyx.scipy.ndimage.map_coordinates. When the
CUDA projection path produces on-device coordinates, the entire
pipeline now stays on GPU with no CPU roundtrip.

Full reproject pipeline (2048x2048, bilinear, 4326->UTM):
  GPU end-to-end: 78ms
  CPU Numba:      161ms
  Speedup:        2.1x
Merges 4 overlapping WGS84 tiles and compares timing and
pixel-level consistency against rioxarray.merge_arrays.

Baseline results (xrspatial is currently slower on merge):
  512x512 tiles:  59ms vs 56ms (1.1x)
  1024x1024:      293ms vs 114ms (2.6x)
  2048x2048:      2.52s vs 656ms (3.8x)

Consistency: RMSE < 0.012, NaN agreement > 99.8%.
Two optimizations that make merge 4.5-7.3x faster:

1. Same-CRS tiles skip reprojection entirely. When source and
   target CRS match, tiles are placed into the output grid by
   direct coordinate alignment (array slicing). Falls back to
   full reprojection if resolutions differ by >1%.

2. try_numba_transform now bails before allocating coordinate
   arrays when neither CRS side is a supported geographic system.
   This saved ~100ms per tile for unsupported pairs.

Merge benchmark (4 overlapping WGS84 tiles):
  512x512:   13ms (was 59ms, now 2.3x faster than rioxarray)
  1024x1024: 48ms (was 293ms, now 2.6x faster than rioxarray)
  2048x2048: 344ms (was 2520ms, now 1.6x faster than rioxarray)
…bles (#1045)

README now shows full pipeline times (transform + resampling) and
merge times, both compared against rioxarray. More useful than the
previous coordinate-transform-only table since users care about
total wall time.
For dask+cupy inputs, eagerly compute the source to GPU memory and
run the in-memory CuPy reproject in a single pass. This avoids the
per-chunk overhead of 64+ dask.delayed calls, each creating a pyproj
Transformer and launching small CUDA kernels.

Before: 958ms (64 delayed chunks, 512x512 each)
After:   43ms (single CuPy pass, pixel-exact same output)
Speedup: 22x

The output is a plain CuPy array. For truly out-of-core GPU data
that doesn't fit in GPU memory, the old dask.delayed path remains
available by passing the data as dask+numpy.
Replaces the eager .compute() approach with a chunked GPU pipeline
that fetches only the needed source window per output chunk. This
handles sources larger than GPU memory while still being 8-20x
faster than the old dask.delayed path.

The key optimizations vs dask.delayed:
- CRS objects and transformer created once (not per chunk)
- CUDA projection + native CUDA resampling per chunk
- Default 2048x2048 GPU chunks (not 512x512)
- Sequential loop avoids dask scheduler overhead

Performance (4096x4096 WGS84 -> UTM, bilinear):
  CuPy single pass:     34ms
  Dask+CuPy (2048):     49ms  (was 958ms)
  Dask+CuPy (512):      71ms
  Dask+CuPy (256):     124ms

All chunk sizes are pixel-exact vs plain CuPy (max_err < 1e-11).
Vendored two NOAA shift grids into the package (306KB total):
- us_noaa_conus.tif: NADCON classic (121x273, 0.25° resolution)
- us_noaa_nadcon5_nad27_nad83_1986_conus.tif: NADCON5 (105x237)

The grid loader checks the vendored directory first, then a user
cache, then downloads from the PROJ CDN as a last resort. Numba
JIT bilinear interpolation applies the lat/lon arc-second offsets
per pixel, with an iterative inverse for target->source direction.

When a grid covers the data, it replaces the Helmert shift (which
had ~3-5m accuracy). The grid gives sub-meter accuracy matching
PROJ with NADCON grids installed. Points outside grid coverage
fall back to Helmert automatically.
Shipped 8.4MB of NOAA/NTv2 shift grids covering:

US: NAD27 CONUS (NADCON + NADCON5), Alaska, Hawaii, Puerto Rico
UK: OSGB36 -> ETRS89 (Ordnance Survey OSTN15)
Australia: AGD66 -> GDA94 (NT region)
Europe: Germany (DHDN), Austria (MGI), Spain (ED50, eastern coast),
  Netherlands (RD), Belgium (BD72), Switzerland (CH1903),
  Portugal (D73)

Added Helmert fallback parameters for all 14 datums plus Tokyo.
Grid lookup automatically selects the best grid covering a point,
falling back to Helmert for regions without grid coverage.

All grids are Public Domain or Open Government Licence.
New public API for ellipsoidal <-> orthometric height conversion:

  geoid_height(lon, lat)              # geoid undulation N (metres)
  geoid_height_raster(da)             # N for every pixel
  ellipsoidal_to_orthometric(h, ...)  # GPS height -> map height
  orthometric_to_ellipsoidal(H, ...)  # map height -> GPS height
  depth_to_ellipsoidal(depth, ...)    # chart datum depth -> ellipsoidal
  ellipsoidal_to_depth(h, ...)        # ellipsoidal -> chart datum depth

Vendored EGM96 global geoid model (2.6MB, 15-arcmin / ~28km grid,
721x1440 pixels). EGM2008 (77MB, 2.5-arcmin) available via CDN
download on first use.

Numba JIT bilinear interpolation with longitude wrapping for the
global grid. Performance: 80 Mpix/s on CPU (1M points in 12ms).
14-parameter Helmert (7 static + 7 rates) for converting between
ITRF frames at a given observation epoch. Parameters parsed from
the PROJ data files shipped with pyproj.

Available frames: ITRF2000, ITRF2008, ITRF2014, ITRF2020 (and
all intermediate frames back to ITRF89).

Usage:
  lon2, lat2, h2 = itrf_transform(
      -74.0, 40.7, 10.0,
      src='ITRF2014', tgt='ITRF2020', epoch=2024.0,
  )

Typical shifts: 2-4mm horizontal, 1-3mm vertical between
ITRF2014 and ITRF2020 at epoch 2024. The rates capture tectonic
plate motion (~mm/year) which accumulates over years.

Numba JIT parallelized for batch transforms.
Helmert upgrade (3-param -> 7-param Bursa-Wolf):
- Added rx/ry/rz rotations (arcsec) and ds scale (ppm) to the
  geocentric datum shift kernel
- Updated OSGB36, DHDN, MGI, ED50, BD72 with published 7-param
  values from the EPSG dataset
- OSGB36 Helmert fallback improved from 15.73m to 1.44m vs pyproj
- Same kernel handles 3-param datums (rx=ry=rz=ds=0)

Authalic latitude series (3-term -> 6-term):
- Extended _authalic_apa to 6 coefficients (10th-order in e²)
- Updated _authalic_inv and CUDA _d_authalic_inv to evaluate 5 terms
- Theoretical improvement: sub-mm for the series itself, though the
  q->beta->phi roundtrip error is dominated by the asin(q/qp) step
  at high latitudes (~4.8m at 89°, <0.1m at mid-latitudes)
)

Numba kernels for two additional projections:

- Oblique Stereographic: Gauss conformal sphere + stereographic.
  Kernel roundtrips perfectly but forward differs from PROJ's
  specific Gauss-Schreiber conformal mapping by ~50km. Needs
  alignment with PROJ's sterea.cpp double-projection approach.

- Oblique Mercator (Hotine): rotation matrix + Mercator on the
  oblique cylinder. Forward has errors in the u/v -> x/y rotation.
  Needs closer alignment with PROJ's omerc.cpp variant handling.

Both kernels are implemented and compile but disabled in the
dispatch table pending accuracy fixes. They fall through to pyproj.

Also: Equidistant Conic skipped -- has zero EPSG definitions in
the PROJ database, essentially unused in practice.
The oblique stereographic now uses the correct PROJ double-projection:
1. Gauss conformal mapping: phi -> chi via scaling factor
   C = sqrt(1 + e²cos⁴φ₀/(1-e²)) and normalization constant K
2. Standard stereographic on the conformal sphere

Forward accuracy: sub-nanometre vs pyproj.
Roundtrip: exact (1.4e-14 degrees).

Also fixed R scaling: R_metric = a * k0 * R_conformal, where
R_conformal is the dimensionless conformal radius from PROJ.

Oblique Mercator (Hotine) remains disabled -- the u/v rotation
and variant handling need more work.
#1045)

New lcc_inverse_2d and tmerc_inverse_2d kernels take 1D x/y arrays
and write directly to 2D output, avoiding:
- np.tile (199ms for 4096x4096)
- np.repeat (357ms for 4096x4096)
- Separate unit division pass (115ms for ftUS)

The unit conversion (feet -> metres) is fused into the inner loop,
operating on scalars instead of 16.8M-element arrays.

California zone 5 ftUS (4096x4096, bilinear):
  Before: 915ms (0.9x vs rioxarray)
  After:  712ms (1.2x vs rioxarray)
Added _norm_lon_rad() and applied it to all inverse functions that
compute lam + lon0. Without normalization, projections with non-zero
lon0 (e.g. EPSG:3413 Arctic Stere with lon0=-45°) could return
longitudes outside [-180, 180], causing 360° errors and false NaN
pixels in the source lookup.

Polar Stere N (EPSG:3413) consistency:
  Before: RMSE=8.29, NaN agree=90.4%, 1.1M extra NaN
  After:  RMSE=0.008, NaN agree=100%, 79 extra NaN (edge pixels)
Changed out-of-bounds threshold from -0.5/sh-0.5 to -1.0/sh in
all resampling kernels (nearest, bilinear, cubic, and CUDA).
Pixels within one pixel of the source edge are now clamped to the
nearest valid pixel instead of being set to nodata.

This matches GDAL/rasterio's boundary convention, fixing 2568
false-NaN pixels at the edges of LAEA Europe reprojections.

LAEA NaN agreement: 99.8% -> 100.0%
All other projections: unchanged or improved to 100.0%
Changed bilinear resampling (CPU Numba + CUDA) from clamp-and-use-all
to skip-and-renormalize, matching GDAL's GWKBilinearResample4Sample:

- Out-of-bounds neighbors: skipped, weight redistributed to valid ones
  (was: clamped to edge pixel, counted at full weight)
- NaN neighbors: skipped, interpolated from remaining valid pixels
  (was: any NaN contaminates the whole output pixel)

This eliminates the one-pixel NaN halo around nodata regions that
the old approach produced, and gives mathematically exact results
on linear surfaces at raster boundaries.

The ~0.017 RMSE vs rioxarray on gradient rasters is unchanged --
it comes from sub-pixel floating-point coordinate differences in
the interior, not edge handling.
Thread safety:
- Added threading.Lock to global caches in _datum_grids.py,
  _vertical.py, and _itrf.py. Concurrent callers no longer race
  on grid loading or ITRF parameter parsing.

All-NaN raster crash:
- np.nanmin on an all-NaN array returns NaN; int(NaN) is undefined.
  Added np.isfinite guards in both numpy and cupy chunk workers.

uint8 cubic overflow:
- Cubic resampling can ring outside [0, 255] on sharp edges.
  Added np.clip clamping before the dtype cast for all integer
  source types (uint8, int16, etc.) across nearest/bilinear/cubic.

Geoid poles:
- _interp_geoid_point now returns NaN (not 0.0) outside the grid's
  latitude range, preventing silent wrong values at the poles.

Exception narrowing:
- Bare except Exception:pass blocks around Numba/CUDA fast paths
  narrowed to except (ImportError, ModuleNotFoundError). Actual
  bugs now propagate instead of silently falling back to pyproj.

New tests:
- test_reproject_1x1_raster: single-pixel source
- test_reproject_all_nan: 100% NaN source
- test_reproject_uint8_cubic_no_overflow: cubic on uint8 sharp edge
Cubic NaN handling:
- When any of the 16 Catmull-Rom neighbors is NaN, falls back to
  bilinear with weight renormalization instead of returning nodata.
  Eliminates the one-pixel nodata halo around NaN regions that
  cubic resampling previously produced.

Merge strategy tests:
- Added end-to-end tests for last, max, min strategies (were only
  tested at the internal _merge_arrays_numpy level).

Datum grid validation:
- load_grid() now validates band shapes match, grid is >= 2x2,
  and bounds are sensible. Invalid grids return None (Helmert
  fallback) instead of producing garbage.

Documentation:
- reproject() and merge() docstrings now note output CRS is WKT
  format in attrs['crs'], and merge() documents CRS selection
  when target_crs=None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant