Pending tasks not starting up

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Greetings!

We sometimes (presumably when ray is under higher load) observe that our ray client creates a job but it never starts. We currently work around this by having timeout when waiting for job result. That naturally takes quite some time, as timeout needs to consider usual job duration as well.

My 2 questions are:

  1. Where to look for what is causing this problem? I checked ray-operator log (see below) but it didn’t help me.

  2. Is there a direct way to query ray about jobs that are really running? Other than listing processes in underlying pods (which we do for our make-shift dashboard) or creating lockfile on shared storage from the job itself (which we’re considering to use).

Thank you

PS: This is one complete status from our operator log where I see 10+ pending tasks/actors but no pending nodes? Can you point us towards what might be wrong in our setup?

The best bet now is to use ray status. We are aware of that this has limited functionality, and we are in the progress of supporting 2 from this REP. The project is in active progress; State Observability API REP by rkooo567 · Pull Request #8 · ray-project/enhancements · GitHub.

For

{'CPU': 1.0, 'worker1Gi': 1.0}: 10+ pending tasks/actors

For 1, I think there are 2 possibilities.

  1. It is a bug from ray client. In this case, you can confirm it by not using ray client, but I am not sure if it is the option for you guys.
  2. There’s a possibility none of node has {1 CPU + 1 worker1Gi} resources. For example, imagine 5 nodes A have {CPU: 1, worker1Gi: 1} and 5 nodes B that have {CPU: 1}. And if there are 5 tasks that use 1 CPU, it can use 1 CPU from A. In this case, the resources will be

A: {CPU: 0, worker1Gi: 1} * 5
B: {CPU: 1} * 5

In this case although the cluster resources is {CPU: 10, worker1Gi: 5}, we cannot schedule 5 of {CPU: 1, worker1Gi: 1} tasks. We are also in progress exposing this information better from the same REP. But for now, you can verify it using ray.state.state._available_resources_per_node() (note that it is a private API, and it is encourage to use official APIs from the above REP once they are available).

This part of the logs is concerning: Cluster status: 23 nodes (2 failed to update)
If you’re using K8s, also take a look at KubeRay which works in more or less the same way, but is more stable.
(The main thing KubeRay is missing at the moment is stable autoscaling support – that’s in progress.)

Thanks for quick reply!

ray status shows the same information as operator log status (part of it). Possibility of ray’s inner state observation would be something that would probably help us in this manner quite a lot.

  1. It is a bug from ray client. In this case, you can confirm it by not using ray client, but I am not sure if it is the option for you guys.

We are using ray client. There are some other options to start tasks from python?

  1. There’s a possibility none of node has {1 CPU + 1 worker1Gi} resources…

That’s when scaling should kick in, right? Ray operator should then launch new k8s pod with such resources (and underlying k8s node autoscaling would launch new node if there is pending pod). Or am I missing something?

But for now, you can verify it using ray.state.state._available_resources_per_node() (note that it is a private API, and it is encourage to use official APIs from the above REP once they are available).

We tried and it resulted in following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/state.py", line 678, in _available_resources_per_node
    self._check_connected()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/state.py", line 49, in _check_connected
    raise ray.exceptions.RaySystemError(
ray.exceptions.RaySystemError: System error: Ray has not been started yet. You can start Ray with 'ray.init()'.

Even though ray.init() has been called and for example ray.cluster_resources() works

We are using k8s and autoscaling is one of the main features that we use. Our workflow is that we run a lot of tasks in bursts couple times a day.

1 Like

A couple of things that would help here:
The configs used to launch the cluster.

If at all possible, a minimal reproduction of the problem.

I’ll try to prepare reproduction. Is it okay to send it to you via Slack?

Slack works, but the Ray GitHub is better!