Skip to content

fix: Infinite hang when MPI process crashes#4001

Open
arng40 wants to merge 22 commits intodevelopfrom
bugfix/dudes/endless-wait-crash-mpi
Open

fix: Infinite hang when MPI process crashes#4001
arng40 wants to merge 22 commits intodevelopfrom
bugfix/dudes/endless-wait-crash-mpi

Conversation

@arng40
Copy link
Copy Markdown
Contributor

@arng40 arng40 commented Mar 18, 2026

The bug occurs when you launch a simulation with multiple ranks and one of the ranks calls a GEOS_THROW. This results in an infinite hang

@arng40 arng40 self-assigned this Mar 18, 2026
@arng40 arng40 added type: bug Something isn't working flag: no rebaseline Does not require rebaseline labels Mar 18, 2026
@arng40 arng40 changed the title bugfix: Infinite hang when MPI process crashes fix: Infinite hang when MPI process crashes Mar 18, 2026
@jhuang2601 jhuang2601 added ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI ci: run code coverage enables running of the code coverage CI jobs labels Mar 18, 2026
@rrsettgast
Copy link
Copy Markdown
Member

rrsettgast commented Mar 18, 2026

The second call to basicCleanup() was done intentionally in:
#3332

to avoid cleanup issues. Are you sure this doesn't reintroduce the issues that it fixed?

@arng40 arng40 marked this pull request as draft March 19, 2026 08:33
Copy link
Copy Markdown
Contributor

@MelReyCG MelReyCG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so let's change the strategy:

  • separate anything that is not about cleaning up from basicCleanup() to a new function (finalizeRun()? something else?),
  • ensure to call basicCleanup() in any scenario, to ensure it is called, but any logging of a potencial error must be done before.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes GEOSX shutdown/cleanup behavior to avoid an infinite MPI hang when a rank crashes after throwing a GEOS exception, by threading an inError flag through cleanup/finalization and adjusting MPI finalization behavior.

Changes:

  • Add an inError parameter to basicCleanup, cleanupEnvironment, finalizeMPI, and MpiWrapper::finalize to distinguish normal vs. error shutdown.
  • Avoid normal MPI finalization/reporting on error paths (terminate via the error handler instead) to prevent collective/finalize hangs.
  • Update main entrypoints and a broad set of unit/integration tests to use the new cleanup signatures.

Reviewed changes

Copilot reviewed 88 out of 88 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/pygeosx/pygeosx.cpp Update Python finalize path to call basicCleanup(false).
src/main/main.cpp Add memory stats report on normal exit and pass inError to cleanup; adjust exception paths.
src/coreComponents/physicsSolvers/fluidFlow/unitTests/testFlowStatistics.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/physicsSolvers/fluidFlow/kernels/singlePhase/unitTests/testSinglePhaseMobilityKernel.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/mesh/unitTests/testGraphColoringMPI.cpp Update MPI finalize call signature to finalize(false).
src/coreComponents/mainInterface/initialization.hpp Change basicCleanup API to basicCleanup(bool inError) and document param.
src/coreComponents/mainInterface/initialization.cpp Wire basicCleanup(inError) to cleanupEnvironment(inError).
src/coreComponents/linearAlgebra/utilities/unitTests/testLAIHelperFunctions.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/linearAlgebra/unitTests/testLinearAlgebraUtils.hpp Update cleanupEnvironment(false) usage in test scope destructor.
src/coreComponents/integrationTests/xmlTests/testXMLFile.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/xmlTests/testXML.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testThermalReservoirCompositionalMultiphaseSSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testThermalReservoirCompositionalMultiphaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testReservoirThermalSinglePhaseMSWells_RateInj.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testReservoirThermalSinglePhaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testReservoirSinglePhaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testReservoirCompositionalMultiphaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testOpenClosePerf.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testIsothermalReservoirCompositionalMultiphaseSSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testIsothermalReservoirCompositionalMultiphaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationTaper.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationQ2.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticVTI.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticTTI.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticFirstOrder.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationDG.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationDAS.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationElastic.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationAcousticVTI.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationAcoustic.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAdjoint1.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAcousticFirstOrder.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagation.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/tableFunctionsFileTests/testTableFunctionsOutput.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/solverStatisticsTests/testSolverStats.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/meshTests/testVTKImport.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/meshTests/testNeighborCommunicator.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/meshTests/testMeshGeneration.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/meshTests/testElementRegions.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testTransmissibility.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testThermalSinglePhaseFlow.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testThermalCompMultiphaseFlow.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testSinglePhaseReactiveTransport.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testSinglePhaseMFDPolyhedral.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testReactiveCompositionalMultiphaseOBL.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testImmiscibleMultiphaseFlow.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testFixedDimHydrostaticEquilibrium.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testCompositionalMultiPhaseMFDPolyhedral.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testCompMultiphaseFlowHybrid.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testCompMultiphaseFlow.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fieldSpecificationTests/testRecursiveFieldApplication.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fieldSpecificationTests/testFieldSpecification.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fieldSpecificationTests/testAquiferBoundaryCondition.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/dataRepositoryTests/testRestartExtended.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/dataRepositoryTests/testRestartBasic.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/dataRepositoryTests/testGroupPath.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/constitutiveTests/testTriaxial.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/constitutiveTests/testRelPermHysteresis.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/constitutiveTests/testReactiveFluid.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/constitutiveTests/testPVT.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/functions/unitTests/testFunctions.cpp Update commented-out cleanup call to new signature.
src/coreComponents/finiteVolume/mimeticInnerProducts/unitTests/testMimeticInnerProducts.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/finiteElement/elementFormulations/unitTests/testConformingVirtualElementOrder1.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/fileIO/timeHistory/unitTests/testHDFParallelFile.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/fileIO/timeHistory/unitTests/testHDFFile.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/fileIO/Outputs/unitTests/testMemoryStats.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/denseLinearAlgebra/unitTests/testBlasLapack.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/dataRepository/unitTests/testWrapperHelpers.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/dataRepository/unitTests/testObjectCatalog.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/dataRepository/unitTests/testErrorHandling.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testTwoPhaseImmiscibleFluid.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidTwoPhaseCompositionalMultiphase.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidThreePhaseCompositionalMultiphase.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidDeadOil.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidCO2Brine.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidBlackOil.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testInverseCapillaryPressure.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testCO2SpycherPruessModels.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testCO2BrinePVTModels.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/relativePermeability/unitTests/testRelPerm.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/constitutive/capillaryPressure/unitTests/testCapillaryPressure.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/common/unitTests/testMpiWrapper.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/common/initializeEnvironment.hpp Update finalizeMPI/cleanupEnvironment APIs to accept inError.
src/coreComponents/common/initializeEnvironment.cpp Thread inError through environment cleanup and MPI finalize; remove memory report from cleanup.
src/coreComponents/common/format/table/unitTests/testMpiTable.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/common/MpiWrapper.hpp Update MPI finalize API to finalize(bool inError) and document param.
src/coreComponents/common/MpiWrapper.cpp Implement finalize(bool) to skip MPI_Finalize() on error and terminate via error handler.
src/coreComponents/codingUtilities/tests/testParallelTestUtilities.cpp Update test cleanup call to basicCleanup(false).

@arng40 arng40 marked this pull request as ready for review March 27, 2026 09:13
@MelReyCG MelReyCG self-requested a review March 31, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci: run code coverage enables running of the code coverage CI jobs ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI flag: no rebaseline Does not require rebaseline type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants