Skip to content

WIP: ENH: OOC architecture rewrite — new bulk I/O API and infrastructure#1568

Draft
joeykleingers wants to merge 8 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-ooc-architecture-rewrite
Draft

WIP: ENH: OOC architecture rewrite — new bulk I/O API and infrastructure#1568
joeykleingers wants to merge 8 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-ooc-architecture-rewrite

Conversation

@joeykleingers
Copy link
Copy Markdown
Contributor

@joeykleingers joeykleingers commented Mar 24, 2026

Summary

Rewrites the out-of-core (OOC) architecture in simplnx, replacing the old chunk-based API with a new bulk I/O design built around copyIntoBuffer/copyFromBuffer on AbstractDataStore. Introduces the core infrastructure that the OOC-optimized filter algorithms (separate PR #1575) build upon.

Core Architecture Changes

  • Removed old chunk API from AbstractDataStore / IDataStore (loadChunk, getNumberOfChunks, getChunkLowerBounds, getChunkUpperBounds, getChunkShape)
  • Added copyIntoBuffer / copyFromBuffer pure virtual bulk I/O methods to AbstractDataStore with implementations in DataStore, EmptyDataStore, and HDF5ChunkedStore (in SimplnxOoc plugin)
  • Added StoreType enum (InMemory, OutOfCore, Empty) to IDataStore; IsOutOfCore() now checks StoreType instead of getChunkShape()
  • HDF5ChunkedStore performs I/O via HDF5 hyperslab selections with Z-slice-aligned default chunk shape {1,Y,X} for 3D data
  • copyFromBuffer fast path: skips read-modify-write for tuple-aligned writes
  • copyIntoBuffer fast path: direct span-based readTuples for tuple-aligned reads
  • HDF5 DatasetIO gains readTuples/writeTuples for direct hyperslab-based bulk tuple I/O

New Core Utilities

  • DispatchAlgorithm — Runtime dispatch between in-core (Direct) and OOC (Scanline/CCL) algorithm variants based on data store type
  • SliceBufferedTransfer — Type-dispatched Z-slice buffered tuple copy utility that eliminates per-element OOC overhead during morphological transfer phases
  • UnionFind — Vector-based disjoint set data structure with union-by-rank and path-halving compression for chunk-sequential CCL algorithms
  • SegmentFeatures OOC path — Z-slice CCL-based connected component labeling with UnionFind equivalence tracking, replacing BFS/DFS flood fill for OOC data
  • AlignSections OOC path — Bulk slice read/write with AlignSectionsTransferDataOocImpl
  • DataArrayUtilities bulk I/OImportFromBinaryFile, AppendData, CopyData, and mirror swap_ranges updated with chunked bulk I/O (runtime OOC check preserves original in-core performance)

OOC Store Management

  • DataIOCollection / IDataIOManager — Updated for OOC store lifecycle management
  • ImportH5ObjectPathsAction — OOC-aware file import with recovery metadata
  • DataStoreIO — Detect OOC recovery attributes in ReadDataStore for safe data restoration
  • Legacy .dream3d support — Handle legacy file formats in OOC backfill operations

Test Infrastructure

  • CompareDataArrays rewritten to use copyIntoBuffer in 40K-element chunks instead of per-element operator[]
  • ForceOocAlgorithmGuard for dual-path test coverage
  • SIMPLNX_TEST_ALGORITHM_PATH CMake option (0=Both, 1=OOC-only, 2=InCore-only) for build-specific test path control
  • Programmatic test data builders with Z-slice batched bulk writes for OOC efficiency

Related PRs

Test Plan

  • Tests pass on in-core build
  • Tests pass on out-of-core build
  • In-core performance verified: no regression on utility changes (CopyData, AppendData, mirror swaps)

@joeykleingers joeykleingers force-pushed the worktree-ooc-architecture-rewrite branch from b4ef97f to 99b49ed Compare March 24, 2026 18:13
@joeykleingers joeykleingers force-pushed the worktree-ooc-architecture-rewrite branch 2 times, most recently from b4ef97f to bb09048 Compare March 24, 2026 18:51
@joeykleingers joeykleingers force-pushed the worktree-ooc-architecture-rewrite branch 4 times, most recently from 102c436 to b4c1358 Compare April 2, 2026 00:55
@joeykleingers joeykleingers changed the title WIP: OOC architecture rewrite — new bulk I/O API, SimplnxOoc plugin, and filter optimizations ENH: OOC architecture rewrite — new bulk I/O API and infrastructure Apr 2, 2026
@joeykleingers joeykleingers force-pushed the worktree-ooc-architecture-rewrite branch 6 times, most recently from 2bd614a to 110c054 Compare April 8, 2026 17:41
@joeykleingers joeykleingers changed the title ENH: OOC architecture rewrite — new bulk I/O API and infrastructure WIP: ENH: OOC architecture rewrite — new bulk I/O API and infrastructure Apr 8, 2026
Replace the chunk-based DataStore API with a plugin-driven hook
architecture that cleanly separates OOC policy (in the SimplnxOoc
plugin) from mechanism (in the core library). The old API required
every caller to understand chunk geometry; the new design hides OOC
details behind bulk I/O primitives and plugin-registered callbacks.

--- AbstractDataStore / IDataStore API ---

Remove the entire chunk API from AbstractDataStore and IDataStore:
loadChunk, getNumberOfChunks, getChunkLowerBounds, getChunkUpperBounds,
getChunkShape, getChunkSize, getChunkTupleShape, getChunkExtents, and
convertChunkToDataStore. Replace with two bulk I/O primitives:
copyIntoBuffer(startIndex, span<T>) and copyFromBuffer(startIndex,
span<const T>), implemented in DataStore (std::copy on raw memory) and
EmptyDataStore (throws). This shifts the abstraction from "load a
chunk, then index into it" to "copy a contiguous range into a caller-
owned buffer," which works identically for in-core and OOC stores.

Simplify StoreType to three values (InMemory, OutOfCore, Empty) by
removing EmptyOutOfCore. IsOutOfCore() now checks StoreType instead
of testing getChunkShape().has_value(). Add getRecoveryMetadata()
virtual to IDataStore for crash-recovery attribute persistence.

--- Plugin Hook System (DataIOCollection / IDataIOManager) ---

Add three plugin-registered callback hooks to DataIOCollection:

  FormatResolverFnc: Decides storage format for a given array based on
    type, shape, and size. Called from DataStoreUtilities::CreateDataStore
    and CreateListStore. Replaces the removed checkStoreDataFormat() and
    TryForceLargeDataFormatFromPrefs — format decisions now live entirely
    in the plugin, with core only calling resolveFormat() when no format
    is already set.

  BackfillHandlerFnc: Post-import callback that lets the plugin finalize
    placeholder stores after all HDF5 objects are read. Called from
    ImportH5ObjectPathsAction after importing all paths. Replaces the
    removed backfillReadOnlyOocStores core implementation.

  WriteArrayOverrideFnc: Intercepts HDF5 writes during recovery file
    creation, allowing the plugin to write lightweight placeholder
    datasets instead of full array data. Activated via RAII
    WriteArrayOverrideGuard, wired into DataStructureWriter.

Add factory registration on IDataIOManager for ListStoreRefCreateFnc,
StringStoreCreateFnc, and FinalizeStoresFnc, with delegating creation
methods on DataIOCollection. Guard against reserved format name
"Simplnx-Default-In-Memory" during IO manager registration.

--- EmptyStringStore Placeholder ---

Add EmptyStringStore, a placeholder class for OOC string array import
that stores only tuple shape metadata. All data access
methods throw std::runtime_error. isPlaceholder() returns true (vs
false for StringStore). StringArrayIO creates EmptyStringStore in OOC mode instead of
allocating numValues empty strings.

--- HDF5 I/O ---

DataStoreIO::ReadDataStore gains two interception paths before the
normal in-core load: (1) recovery file detection via OocBackingFilePath
HDF5 attributes, creating a read-only reference store pointing at the
backing file; (2) OOC format resolution via resolveFormat(), creating a
read-only reference store directly from the source .dream3d file with
no temp copy.

DataArrayIO::writeData always calls WriteDataStore
directly — OOC stores materialize their data through the plugin's
writeHdf5() method; recovery writes use WriteArrayOverrideFnc.

NeighborListIO gains OOC interception: computes total neighbor count,
calls resolveFormat(), and creates a read-only ref list store when an
OOC format is available. Legacy NeighborList reading passes a preflight
flag through the entire call chain (readLegacyNeighborList ->
createLegacyNeighborList -> ReadHdf5Data) so legacy .dream3d imports
create EmptyListStore placeholders instead of eagerly loading per-
element via setList().

DataStructureWriter checks WriteArrayOverrideFnc before normal writes,
giving the registered plugin callback first chance to handle each
data object.

Add explicit template instantiations for DatasetIO::createEmptyDataset
and DatasetIO::writeSpanHyperslab for all numeric types plus bool.
These are needed by the SimplnxOoc plugin's AbstractOocStore::writeHdf5(),
which cannot use writeSpan() because the full array is not in memory.
Instead it creates an empty dataset, then fills it region-by-region
via hyperslab writes as it streams data from the backing file.

--- Preferences ---

Add unified oocMemoryBudgetBytes preference (default 8 GB) that
the ChunkCache, visualization, and stride cache all use. Add k_InMemoryFormat
sentinel constant for explicit in-core format choice. Add migration
logic to erase legacy empty-string and "In-Memory" preference values.
checkUseOoc() now tests against k_InMemoryFormat.
setLargeDataFormat("") removes the key so plugin defaults take effect.

--- Algorithm Infrastructure ---

AlgorithmDispatch: Add ForceInCoreAlgorithm/ForceOocAlgorithm global
flags with RAII guards. Add DispatchAlgorithm template that selects
Direct (in-core) vs Scanline (OOC) algorithm variant based on store
types and force flags. Add SIMPLNX_TEST_ALGORITHM_PATH CMake option
(0=both, 1=OOC-only, 2=InCore-only) for dual-dispatch test control.

IParallelAlgorithm: Remove blanket TBB disabling for OOC data — OOC
stores are now thread-safe via ChunkCache + HDF5 global mutex.
CheckStoresInMemory/CheckArraysInMemory use StoreType instead of
getDataFormat().

VtkUtilities: Rewrite binary write path to read into 4096-element
buffers via copyIntoBuffer, byte-swap in the buffer, and fwrite —
replacing direct DataStore data() pointer access.

--- Filter Algorithm Updates ---

FillBadData: Rewrite phaseOneCCL and phaseThreeRelabeling to use
Z-slab buffered I/O via copyIntoBuffer/copyFromBuffer instead of
the removed chunk API (loadChunk, getChunkLowerBounds, etc.).
operator()() scans feature counts in 64K-element chunks via
copyIntoBuffer.

QuickSurfaceMesh: Remove getChunkShape() call in generateTripleLines()
that set ParallelData3DAlgorithm chunk size, as the chunk API no
longer exists on AbstractDataStore.

--- File Import ---

ImportH5ObjectPathsAction: Add deferred-load pattern. When a backfill
handler is registered, pass preflight=true to create placeholder stores
during import, then call runBackfillHandler() after all paths are
imported to let the plugin finalize.

Dream3dIO: Add WriteRecoveryFile() that wraps WriteFile with WriteArrayOverrideGuard.

--- Utility Changes ---

DataStoreUtilities: Remove TryForceLargeDataFormatFromPrefs entirely.
CreateDataStore and CreateListStore call resolveFormat() on the IO
collection. ArrayCreationUtilities: check k_InMemoryFormat sentinel
before skipping memory checks.

ITKArrayHelper/ITKTestBase: OOC checks use getStoreType() instead of
getDataFormat().empty(). IsArrayInMemory simplified from a 40-line
DataType switch to a single StoreType check.

ArraySelectionParameter: Remove EmptyOutOfCore handling; simplify to
just StoreType::Empty.

--- Tests ---

Add EmptyStringStore tests (6 cases: metadata, zero tuples, throwing
access, deep copy placeholder preservation, resize, isPlaceholder).
Add DataIOCollection hooks tests (format resolver, backfill handler).
Add IOFormat tests (7 cases: InMemory sentinel, empty format,
resolveFormat with/without plugin). Add IParallelAlgorithm OOC tests
(8 cases with MockOocDataStore: TBB enablement for in-memory, OOC,
and mixed arrays/stores).

Remove the "Target DataStructure Size" test from IOFormat.cpp — it
was a tautology that re-implemented the same arithmetic as
updateMemoryDefaults() without testing any edge case or behavior.

Fix RodriguesConvertorTest exemplar data: add missing expected values
for the 4th tuple (indices 12-15). The old CompareDataArrays broke
on the first floating-point mismatch regardless of magnitude, masking
this incomplete exemplar. The new chunked comparison correctly
continues past epsilon-close differences, exposing the missing data.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add comprehensive documentation to all new methods, type aliases,
classes, and algorithms introduced in the OOC architecture rewrite.
Every new public API now has Doxygen explaining what it does, how it
works, and why it is needed. Algorithm implementations have step-by-
step inline comments explaining the logic.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
…ation layer

Move the format resolver call site from the low-level DataStoreUtilities::
CreateDataStore/CreateListStore functions up to the array creation layer
(ArrayCreationUtilities::CreateArray and ImportH5ObjectPathsAction). This
is a prerequisite for the upcoming data store import handler refactor.

Key architectural changes:

1. FormatResolverFnc signature expanded to (DataStructure, DataPath,
   DataType, dataSizeBytes). The resolver can now walk parent objects to
   determine geometry type, enabling it to force in-core for unstructured/
   poly geometry arrays without caller-side checks.

2. Format resolution removed from DataStoreUtilities::CreateDataStore and
   CreateListStore. These are now simple factories that take an already-
   resolved format string. Callers are responsible for calling the resolver.

3. CreateArrayAction no longer carries a dataFormat member or constructor
   parameter. The k_DefaultDataFormat constant is removed. Format is
   resolved at execute time inside ArrayCreationUtilities::CreateArray.

4. ImportH5ObjectPathsAction gains a format-resolver loop that iterates
   Empty-store DataArrays after preflight import, consulting the resolver
   to decide which arrays to eager-load (in-core) vs leave for the
   backfill handler (OOC).

5. DataStoreIO::ReadDataStore and NeighborListIO::finishImportingData lose
   their inline format-resolution and OOC reference-store creation code.
   Format decisions for imported data are now made at the action level,
   not during raw HDF5 I/O.

6. Geometry actions (CreateGeometry1D/2D/3DAction, CreateVertexGeometry,
   CreateRectGridGeometry) lose their createdDataFormat parameter. They
   now materialize OOC topology arrays into in-core stores when the source
   arrays have StoreType::OutOfCore, since unstructured/poly geometry
   topology must be in-core for the visualization layer.

7. CheckMemoryRequirement simplified to a pure RAM check. OOC fallback
   logic removed since the resolver handles format decisions upstream.

All filter callers updated to drop the dataFormat argument from
CreateArrayAction constructors. Python binding updated (data_format
parameter renamed to fill_value). Test files updated for new
resolveFormat signature.
…arden .dream3d import

Rename the "backfill handler" to "data store import handler" and expand
its role to handle ALL data store loading from .dream3d files — in-core
eager loading, OOC reference stores, and recovery reattachment. This
replaces the split decision-making where ImportH5ObjectPathsAction ran
a format-resolver loop and a separate backfill handler.

Key changes:

1. DataIOCollection: Rename BackfillHandlerFnc to
   DataStoreImportHandlerFnc with expanded signature that includes
   importStructure. Rename set/has/runBackfillHandler to
   set/has/runDataStoreImportHandler. Add format display name registry
   (registerFormatDisplayName/getFormatDisplayNames) for human-readable
   format names in the UI dropdown.

2. DataStoreIO: Rename ReadDataStore to ReadDataStoreIntoMemory. Remove
   recovery reattachment code (OOC-specific HDF5 attribute checks moved
   to SimplnxOoc plugin). Add placeholder detection — compares physical
   HDF5 element count against shape attributes, returns Result<> with
   warning when mismatch detected (guards against loading placeholder
   datasets without the OOC plugin). Change return type to
   Result<shared_ptr<AbstractDataStore<T>>> so callers can accumulate
   warnings across arrays.

3. ImportH5ObjectPathsAction: Remove the format-resolver loop (79 lines).
   The action now delegates entirely to the registered handler when
   present, or falls back to FinishImportingObject for non-OOC builds.

4. CreateArrayAction: Restore dataFormat parameter for per-filter format
   override. When non-empty, bypasses the format resolver. Dropdown shows
   "Automatic" (resolver decides), "In Memory", and plugin-registered
   formats with display names. Fix 12 filter callers where fillValue was
   being passed as dataFormat after parameter reordering.

5. Dream3dIO: Route DREAM3D::ReadFile through ImportH5ObjectPathsAction
   so recovery and OOC hooks fire. Remove unused ImportDataObjectFromFile
   and ImportSelectDataObjectsFromFile.

6. Application: Add getDataStoreFormatDisplayNames() to expose display
   name registry to DataStoreFormatParameter.

Updated callers: DataArrayIO (2 sites), NeighborListIO (2 sites),
Dream3dIO (2 legacy helpers), DataStructureWriter (comment), 12 filter
files, simplnxpy Python binding, DataIOCollectionHooksTest.
Replace the old Dream3dIO public API (ReadFile, ImportDataStructureFromFile,
FinishImportingObject) with four new purpose-specific functions:

  - LoadDataStructure(path) — full load with OOC handler support
  - LoadDataStructureArrays(path, dataPaths) — selective array load with pruning
  - LoadDataStructureMetadata(path) — metadata-only skeleton (preflight)
  - LoadDataStructureArraysMetadata(path, dataPaths) — pruned metadata skeleton

The new API eliminates the bool preflight parameter in favor of distinct
functions, decouples pipeline loading from DataStructure loading, and
centralizes the OOC handler integration in a single internal
LoadDataStructureWithHandler function.

Key changes:

DataIOCollection: Add EagerLoadFnc typedef and pass it through the
DataStoreImportHandlerFnc signature, replacing the importStructure parameter.
The handler can now eager-load individual arrays via callback without knowing
Dream3dIO internals.

ImportH5ObjectPathsAction: Rewrite to use the new API — preflight calls
LoadDataStructureMetadata, execute calls LoadDataStructure. The action no
longer manages HDF5 file handles or deferred loading directly; it merges
source objects into the pipeline DataStructure via shallow copy.

ReadDREAM3DFilter: Switch preflight from ImportDataStructureFromFile(reader,
true) to LoadDataStructureMetadata(path), removing manual HDF5 file open.

Dream3dIO internals: Move LoadDataObjectFromHDF5, EagerLoadDataFromHDF5,
PruneDataStructure, and LoadDataStructureWithHandler into an anonymous
namespace. LoadDataStructureWithHandler implements the shared logic: build
metadata skeleton, optionally delegate to the OOC import handler, fall back
to eager in-core loading.

Test callers: Switch ComputeIPFColorsTest, RotateSampleRefFrameTest,
DREAM3DFileTest, and H5Test to UnitTest::LoadDataStructure. Add
Dream3dLoadingApiTest with coverage for all four new functions.

UnitTestCommon: Simplify LoadDataStructure/LoadDataStructureMetadata helpers
to delegate directly to the new DREAM3D:: functions.
Add the namespace fs = std::filesystem alias to .cpp files that spell
out std::filesystem, consistent with the existing convention used
throughout the codebase (e.g., AtomicFile.cpp, FileUtilities.cpp,
all ITK test files, UnitTestCommon.hpp).

Files updated: Dream3dIO.cpp, ImportH5ObjectPathsAction.cpp,
DataIOCollection.cpp, H5Test.cpp, UnitTestCommon.cpp,
DREAM3DFileTest.cpp, ComputeIPFColorsTest.cpp.
Previously IDataStore provided a default implementation that returned
an empty map, which silently disabled recovery metadata for any store
subclass that forgot to override it. Make it pure virtual so every
concrete store must explicitly state what (if any) recovery metadata
it produces.

DataStore overrides it to return an empty map (in-memory stores have
no backing file or external state, so the recovery file's HDF5 dataset
contains all the data needed to reconstruct the store).

EmptyDataStore overrides it to throw std::runtime_error, matching the
fail-fast behavior of every other data-access method on this metadata-
only placeholder class. Querying recovery metadata on a placeholder is
a programming error: the real store that replaces the placeholder
during execution is the one responsible for providing recovery info.

MockOocDataStore in IParallelAlgorithmTest.cpp gains a no-op override
returning an empty map so it remains constructible.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
@joeykleingers joeykleingers force-pushed the worktree-ooc-architecture-rewrite branch from 73d697c to 4a7cd61 Compare April 10, 2026 15:25
…y format sentinel

Addresses code review feedback on DataIOCollection ownership and factory
error messages.

Ownership clarification:
* DataStoreUtilities::GetIOCollection() and Application::getIOCollection()
  now return DataIOCollection& instead of std::shared_ptr. The collection
  is owned by the Application singleton which outlives every caller, so a
  reference expresses non-ownership more clearly than a shared_ptr and
  prevents accidental lifetime extension.
* WriteArrayOverrideGuard stores a DataIOCollection& member. Since the
  guard is already non-copyable and non-movable, a reference member is
  natural and the "may be null no-op" path was dropped (no caller used it).

In-memory format sentinel hygiene:
* CoreDataIOManager::formatName() now returns Preferences::k_InMemoryFormat
  instead of the empty string. Empty means "unset/auto" and k_InMemoryFormat
  means "explicit in-memory"; previously "" was doing double duty.
* DataIOCollection constructor registers the core manager directly into
  the manager map, bypassing the addIOManager() guard. The guard still
  rejects plugin registrations under the reserved name.
* createDataStore/createListStore fallbacks now look up the core manager
  from m_ManagerMap under k_InMemoryFormat instead of constructing a
  fresh local CoreDataIOManager.
* ArrayCreationUtilities no longer translates k_InMemoryFormat to "";
  the RAM-check path recognizes both sentinels as in-core.

Actionable factory errors:
* Added DataIOCollection::generateManagerListString() that produces a
  padded multi-line capability matrix of every registered IO manager and
  the store types it supports (DataStore, ListStore, StringStore,
  ReadOnlyRef(DataStore), ReadOnlyRef(ListStore)). Uses display names
  where registered, falling back to the raw format identifier.
* Wired the helper into the existing CreateArray nullptr-check error
  message so users can immediately see which formats are available when
  a requested format is unknown.

Tests updated to reflect the new reference API.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant