When I am fine-tuning with SFTTrainer
on the Ray platform, I encounter the following error:
[rank5]:[E ProcessGroupNCCL.cpp:1293] [PG 0 Rank 5] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
.
My current solution is to reduce the size of the evaluation dataset, but this is not an ideal solution for my fine-tuning task. Has anyone else encountered this issue, and how did you resolve it?