Fault tolerance

In the current workflow, if we encounter a compute node where the GPU cannot be found (which is a common occurrence), it will ultimately kill the entire workflow.   

- It would be beneficial if the workflows could be designed to run a health check first before starting jobs.  E.g., a simple script to see if the expected GPU can be found, and if not, provide a descriptive message in the logs about what node is having problems and return a specific error message. 
- It looks like Dask has a `retire_workers` function that could be called to remove a problematic nodes from consideration, in response to the health check. 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault tolerance #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fault tolerance #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions