NCCL Error 1: unhandled cuda error

When I run the training script, I ran into an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run.sh


This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step  completes successfully and right after 'Evaluating: 0%' is printed.


I have made sure torch can pick up the cuda info:
>>> print(torch.cuda.is_available())
True


![image](https://user-images.githubusercontent.com/36672023/121572325-0baba900-c9f2-11eb-9331-305ef2f0ffd4.png)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL Error 1: unhandled cuda error #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NCCL Error 1: unhandled cuda error #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions