Ray k8s cluster, cannot run new task when previous task failed

Thanks for the details here. So the cluster is started by the admin and you login to one of the workers and call ray.init right?

I notice the actual error is

ModuleNotFoundError: No module named 'components'                                                                              

Can you verify that you have this module in all worker nodes? Also, if it’s a local module, can you try py_modules in the runtime env?

One of the reasons I can think of is that the first time when you ran it, the ray python worker starts on the node your local module is, and the next time when you ran it (not restart the cluster), it got scheduled to another worker. The root cause might be that the two workers are running in a different environment which in the end makes you see different results.