Ray Training on Spark Errors and is Unstable

soffer · August 7, 2023, 5:38pm

When using Ray on Spark (ray.util.spark.cluster_init) for multi-GPU training, it often fails. It may not even start training, the dashboard shows all the GPUs at 100% but training doesn’t start and will hang there forever. Then if it does start to train, it will often fail halfway with a bunch of “no handle found for ____” messages. And then if it does train all the way, it will error at the very end and not give a reason why. I’m testing this with the HuggingFace trainer, but not seeing a clear reason why this is happening.

Topic		Replies	Views
Ray on Spark example failing	1	703	February 1, 2024
Can't use GPUs on local cluster Ray Clusters	3	658	September 11, 2024
[RaySGD] Training instability Ray Train	6	1052	March 17, 2021
Random Halt and No Error/Warnings	3	24	November 10, 2024
Ray Cluster, why does the program freeze and stop executing when the number of GPUs required by the program requires the GPUs of two machines Ray Clusters	0	282	January 14, 2023

Ray Training on Spark Errors and is Unstable

Related topics