Skip to content

Add opt-in solver-step–based checkpointing via NetCDF output#1913

Closed
CodersAcademy006 wants to merge 1 commit intogoogle-deepmind:mainfrom
CodersAcademy006:feature/checkpointing-mvp
Closed

Add opt-in solver-step–based checkpointing via NetCDF output#1913
CodersAcademy006 wants to merge 1 commit intogoogle-deepmind:mainfrom
CodersAcademy006:feature/checkpointing-mvp

Conversation

@CodersAcademy006
Copy link
Copy Markdown

@CodersAcademy006 CodersAcademy006 commented Jan 18, 2026

This PR introduces minimal, opt-in checkpointing support for TORAX simulations,
triggered by solver step count.

Checkpoints reuse the existing xarray DataTree output representation and are
written as a dedicated NetCDF file that is overwritten in place and remains
restart-compatible. The feature is disabled by default and does not modify
solver behavior or final output semantics.

This PR intentionally limits scope to solver-step triggering only. Additional
trigger modes (e.g. wall-clock time, simulation time) will be added in
follow-up PRs if desired.

Fix #1894

@CodersAcademy006
Copy link
Copy Markdown
Author

@jcitrin Please review this and provide me any feedbacks. Thank You!!

@Saad-Mallebhari
Copy link
Copy Markdown

There is currently one failing CI shard while the rest pass, which suggests this may be related to test behavior rather than the core logic itself. A few things that might be worth checking or adjusting:

  • Checkpoint I/O during tests:
    Since checkpointing writes a NetCDF file, it may help to ensure this is fully isolated from test environments (e.g., using a temp path or guarding against unintended writes during unit tests).

  • Execution frequency:
    You may want to double-check that checkpointing only triggers when explicitly enabled and that every_n_steps is respected strictly (no accidental execution on step 0 or finalization).

  • Determinism / test flakiness:
    If the failing test involves timing or filesystem access, it might help to gate checkpointing behind an additional check or mock it in tests.

  • Safety guard:
    Consider adding a small guard to ensure checkpointing only runs when torax_config.checkpointing.enabled is explicitly true and path is valid, even in edge cases.

@jcitrin
Copy link
Copy Markdown
Collaborator

jcitrin commented Mar 24, 2026

Apologies but closing for now due to lack of activity, low prioritization, and lack of review capacity. We can consider reopening this later.

@jcitrin jcitrin closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Design discussion: Simulation checkpoint & restart support for TORAX

3 participants