When using Ray on Spark (ray.util.spark.cluster_init) for multi-GPU training, it often fails. It may not even start training, the dashboard shows all the GPUs at 100% but training doesn’t start and will hang there forever. Then if it does start to train, it will often fail halfway with a bunch of “no handle found for ____” messages. And then if it does train all the way, it will error at the very end and not give a reason why. I’m testing this with the HuggingFace trainer, but not seeing a clear reason why this is happening.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Ray on Spark example failing
|
1 | 657 | February 1, 2024 | |
Can't use GPUs on local cluster | 3 | 611 | September 11, 2024 | |
[RaySGD] Training instability | 6 | 1041 | March 17, 2021 | |
Random Halt and No Error/Warnings | 3 | 20 | November 10, 2024 | |
Ray Cluster, why does the program freeze and stop executing when the number of GPUs required by the program requires the GPUs of two machines | 0 | 273 | January 14, 2023 |