Skip to content

Fault tolerance #21

@chrisiacovella

Description

@chrisiacovella

In the current workflow, if we encounter a compute node where the GPU cannot be found (which is a common occurrence), it will ultimately kill the entire workflow.

  • It would be beneficial if the workflows could be designed to run a health check first before starting jobs. E.g., a simple script to see if the expected GPU can be found, and if not, provide a descriptive message in the logs about what node is having problems and return a specific error message.
  • It looks like Dask has a retire_workers function that could be called to remove a problematic nodes from consideration, in response to the health check.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions