When I run the training script, I ran into an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run.sh
This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.
I have made sure torch can pick up the cuda info:
print(torch.cuda.is_available())
True

When I run the training script, I ran into an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run.sh
This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.
I have made sure torch can pick up the cuda info: