Ray k8s cluster, cannot run new task when previous task failed

yic · June 14, 2022, 5:49pm

Thanks for the details here. So the cluster is started by the admin and you login to one of the workers and call ray.init right?

I notice the actual error is

ModuleNotFoundError: No module named 'components'

Can you verify that you have this module in all worker nodes? Also, if it’s a local module, can you try py_modules in the runtime env?

One of the reasons I can think of is that the first time when you ran it, the ray python worker starts on the node your local module is, and the next time when you ran it (not restart the cluster), it got scheduled to another worker. The root cause might be that the two workers are running in a different environment which in the end makes you see different results.

Topic		Replies	Views
Ray cluster crashes as soon as i add a worker Ray Clusters	1	41	August 26, 2024
ModuleNotFound error after ray.init() Ray Clusters	0	205	February 21, 2024
ModuleNotFoundError for ray.autoscaler._private._kubernetes Kubernetes	0	474	June 22, 2023
Ray_xgboost on K8 Kubernetes	2	481	January 9, 2024
Failure to serialize response Ray Clusters	2	1826	April 28, 2022

Ray k8s cluster, cannot run new task when previous task failed

Related topics