Description
When running the distributed LAV (Least Absolute Value) solver test with a custom LPAffineCtrl configuration where print, progress, and time flags are enabled, the test stalls and exceeds the 5-minute watchdog timeout. This occurs specifically when executed across multiple MPI ranks (e.g., using mpiexec with 4 processes).
Environment
- Julia Versions: 1.10 and release candidate builds
- MPI Setup:
MPICH_jll or OpenMPI via mpiexec -n 4
- Trigger: Fresh dependency resolution (common in CI/CD environments)
Problem Details
Custom Control Configuration Causes Stall
A custom LPAffineCtrl is constructed using Float64 with an embedded MehrotraCtrl. The following configuration triggers the hang:
print, progress, and time flags enabled (both in MehrotraCtrl and top-level).
outerEquil set to true.
- Passing this object into
El.lav during a distributed run.
Workaround: Default Behavior Works
Calling El.lav(A, b) without passing a custom control object completes successfully within the timeout and does not hang.
Root Cause Hypothesis
I/O Contention & Synchronization: Multi-rank logging and progress instrumentation likely introduce bottlenecks. When multiple MPI processes attempt to emit output simultaneously, it may lead to deadlocks or extreme performance degradation.
Supporting Observations:
- Default Success: The test completes immediately without logging/progress flags.
- MPI Specificity: The stall is exclusive to multi-rank settings with enabled instrumentation.
- Environment Factor: The issue is more prevalent after fresh dependency resolution; cached local builds are less susceptible.
Impact
To prevent CI timeouts, the test suite currently uses a simplified solver invocation. This reduces test coverage for:
- Custom solver control options (
LPAffineCtrl).
- Progress reporting functionality.
- Timing and performance instrumentation in distributed environments.
Suggested Investigation
- Profiling: Profile solver behavior in multi-rank mode with logging enabled vs. disabled.
- I/O Inspection: Inspect MPI synchronization and I/O patterns inside
El.lav when flags are active.
- Algorithm Isolation: Determine if this is specific to the Mehrotra interior point algorithm or affects other configurations.
- Scaling Tests: Test across different MPI process counts (2, 8, 16) to observe scaling behavior.
References
- Fix History: Previous stabilization involved reducing the test matrix size (50 to 12) and reverting to the default solver path.
- Test File:
lav.jl (lines 5–6 problem size, line 97 solver invocation).
Description
When running the distributed LAV (Least Absolute Value) solver test with a custom
LPAffineCtrlconfiguration whereprint,progress, andtimeflags are enabled, the test stalls and exceeds the 5-minute watchdog timeout. This occurs specifically when executed across multiple MPI ranks (e.g., usingmpiexecwith 4 processes).Environment
MPICH_jllorOpenMPIviampiexec -n 4Problem Details
Custom Control Configuration Causes Stall
A custom
LPAffineCtrlis constructed usingFloat64with an embeddedMehrotraCtrl. The following configuration triggers the hang:print,progress, andtimeflags enabled (both inMehrotraCtrland top-level).outerEquilset totrue.El.lavduring a distributed run.Workaround: Default Behavior Works
Calling
El.lav(A, b)without passing a custom control object completes successfully within the timeout and does not hang.Root Cause Hypothesis
I/O Contention & Synchronization: Multi-rank logging and progress instrumentation likely introduce bottlenecks. When multiple MPI processes attempt to emit output simultaneously, it may lead to deadlocks or extreme performance degradation.
Supporting Observations:
Impact
To prevent CI timeouts, the test suite currently uses a simplified solver invocation. This reduces test coverage for:
LPAffineCtrl).Suggested Investigation
El.lavwhen flags are active.References
lav.jl(lines 5–6 problem size, line 97 solver invocation).