How severe does this issue affect your experience of using Ray?
Low: It annoys or frustrates me for a moment.
Greetings!
We sometimes (presumably when ray is under higher load) observe that our ray client creates a job but it never starts. We currently work around this by having timeout when waiting for job result. That naturally takes quite some time, as timeout needs to consider usual job duration as well.
My 2 questions are:
Where to look for what is causing this problem? I checked ray-operator log (see below) but it didn’t help me.
Is there a direct way to query ray about jobs that are really running? Other than listing processes in underlying pods (which we do for our make-shift dashboard) or creating lockfile on shared storage from the job itself (which we’re considering to use).
Thank you
PS: This is one complete status from our operator log where I see 10+ pending tasks/actors but no pending nodes? Can you point us towards what might be wrong in our setup?
It is a bug from ray client. In this case, you can confirm it by not using ray client, but I am not sure if it is the option for you guys.
There’s a possibility none of node has {1 CPU + 1 worker1Gi} resources. For example, imagine 5 nodes A have {CPU: 1, worker1Gi: 1} and 5 nodes B that have {CPU: 1}. And if there are 5 tasks that use 1 CPU, it can use 1 CPU from A. In this case, the resources will be
A: {CPU: 0, worker1Gi: 1} * 5
B: {CPU: 1} * 5
In this case although the cluster resources is {CPU: 10, worker1Gi: 5}, we cannot schedule 5 of {CPU: 1, worker1Gi: 1} tasks. We are also in progress exposing this information better from the same REP. But for now, you can verify it using ray.state.state._available_resources_per_node() (note that it is a private API, and it is encourage to use official APIs from the above REP once they are available).
This part of the logs is concerning: Cluster status: 23 nodes (2 failed to update)
If you’re using K8s, also take a look at KubeRay which works in more or less the same way, but is more stable.
(The main thing KubeRay is missing at the moment is stable autoscaling support – that’s in progress.)
ray status shows the same information as operator log status (part of it). Possibility of ray’s inner state observation would be something that would probably help us in this manner quite a lot.
It is a bug from ray client. In this case, you can confirm it by not using ray client, but I am not sure if it is the option for you guys.
We are using ray client. There are some other options to start tasks from python?
There’s a possibility none of node has {1 CPU + 1 worker1Gi} resources…
That’s when scaling should kick in, right? Ray operator should then launch new k8s pod with such resources (and underlying k8s node autoscaling would launch new node if there is pending pod). Or am I missing something?
But for now, you can verify it using ray.state.state._available_resources_per_node() (note that it is a private API, and it is encourage to use official APIs from the above REP once they are available).
We tried and it resulted in following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/state.py", line 678, in _available_resources_per_node
self._check_connected()
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/state.py", line 49, in _check_connected
raise ray.exceptions.RaySystemError(
ray.exceptions.RaySystemError: System error: Ray has not been started yet. You can start Ray with 'ray.init()'.
Even though ray.init() has been called and for example ray.cluster_resources() works