The "Heartbeat monitor timed out!" error in SFTTrainer on the Ray platform

When I am fine-tuning with SFTTrainer on the Ray platform, I encounter the following error:
[rank5]:[E ProcessGroupNCCL.cpp:1293] [PG 0 Rank 5] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1.

My current solution is to reduce the size of the evaluation dataset, but this is not an ideal solution for my fine-tuning task. Has anyone else encountered this issue, and how did you resolve it?

Can you share the full stack trace? And also can you elaborate more on why reducing the evaluation dataset size can resolve this issue?