Ray Training on Spark Errors and is Unstable

When using Ray on Spark (ray.util.spark.cluster_init) for multi-GPU training, it often fails. It may not even start training, the dashboard shows all the GPUs at 100% but training doesn’t start and will hang there forever. Then if it does start to train, it will often fail halfway with a bunch of “no handle found for ____” messages. And then if it does train all the way, it will error at the very end and not give a reason why. I’m testing this with the HuggingFace trainer, but not seeing a clear reason why this is happening.