I encountered an issue while using Ray Train. When initializing Ray with a remote Kubernetes cluster using ray.init() and the cluster address, the training task gets stuck at .fit(). The node’s output remains in a “PENDING” status, which loops indefinitely. I have a screenshot of the pending status to provide for reference.
In my Kubernetes Ray cluster setup, I assigned 1 CPU to the head node and 1 CPU to each of the 4 worker nodes. When running the trainer, I requested num_workers=3. I am able to utilize the Kubernetes cluster fine with regular tasks that are decorated with @ray.remote, so it seems to be a Ray Train specific issue.
I have tried to resolve the issue by ensuring the number of workers specified in ScalingConfig() is smaller than the number of worker pods in the Kubernetes cluster, checking worker nodes’ logs, and examining the head node’s logs, but the problem persists.
Has anyone encountered a similar issue with Ray Train? If so, how did you resolve it? Are there any known issues with Ray Train or its dependencies that could be causing this behavior? Any help or suggestions would be greatly appreciated.
can you post your ScalingConfig() or the entire Trainer setup here?
also, in the pending state, if you wait a minute, Ray will print a message saying “you are asking xyz resources, and there are only uvw resources in the cluster. ignore if there;s auto-scaling” etc etc.
do you see this meesage? If so, that will help us understand how much resources you are actually asking from Ray Train’s perspective.