In the current workflow, if we encounter a compute node where the GPU cannot be found (which is a common occurrence), it will ultimately kill the entire workflow.
- It would be beneficial if the workflows could be designed to run a health check first before starting jobs. E.g., a simple script to see if the expected GPU can be found, and if not, provide a descriptive message in the logs about what node is having problems and return a specific error message.
- It looks like Dask has a
retire_workers function that could be called to remove a problematic nodes from consideration, in response to the health check.
In the current workflow, if we encounter a compute node where the GPU cannot be found (which is a common occurrence), it will ultimately kill the entire workflow.
retire_workersfunction that could be called to remove a problematic nodes from consideration, in response to the health check.