Pending tasks not starting up

Filip_K · May 10, 2022, 8:33am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Greetings!

We sometimes (presumably when ray is under higher load) observe that our ray client creates a job but it never starts. We currently work around this by having timeout when waiting for job result. That naturally takes quite some time, as timeout needs to consider usual job duration as well.

My 2 questions are:

Where to look for what is causing this problem? I checked ray-operator log (see below) but it didn’t help me.
Is there a direct way to query ray about jobs that are really running? Other than listing processes in underlying pods (which we do for our make-shift dashboard) or creating lockfile on shared storage from the job itself (which we’re considering to use).

Thank you

PS: This is one complete status from our operator log where I see 10+ pending tasks/actors but no pending nodes? Can you point us towards what might be wrong in our setup?

sangcho · May 12, 2022, 2:35am

The best bet now is to use ray status. We are aware of that this has limited functionality, and we are in the progress of supporting 2 from this REP. The project is in active progress; State Observability API REP by rkooo567 · Pull Request #8 · ray-project/enhancements · GitHub.

For

{'CPU': 1.0, 'worker1Gi': 1.0}: 10+ pending tasks/actors

For 1, I think there are 2 possibilities.

It is a bug from ray client. In this case, you can confirm it by not using ray client, but I am not sure if it is the option for you guys.
There’s a possibility none of node has {1 CPU + 1 worker1Gi} resources. For example, imagine 5 nodes A have {CPU: 1, worker1Gi: 1} and 5 nodes B that have {CPU: 1}. And if there are 5 tasks that use 1 CPU, it can use 1 CPU from A. In this case, the resources will be

A: {CPU: 0, worker1Gi: 1} * 5
B: {CPU: 1} * 5

In this case although the cluster resources is {CPU: 10, worker1Gi: 5}, we cannot schedule 5 of {CPU: 1, worker1Gi: 1} tasks. We are also in progress exposing this information better from the same REP. But for now, you can verify it using ray.state.state._available_resources_per_node() (note that it is a private API, and it is encourage to use official APIs from the above REP once they are available).

Dmitri · May 12, 2022, 3:18am

This part of the logs is concerning: Cluster status: 23 nodes (2 failed to update)
If you’re using K8s, also take a look at KubeRay which works in more or less the same way, but is more stable.
(The main thing KubeRay is missing at the moment is stable autoscaling support – that’s in progress.)

Filip_K · May 12, 2022, 9:01am

Thanks for quick reply!

ray status shows the same information as operator log status (part of it). Possibility of ray’s inner state observation would be something that would probably help us in this manner quite a lot.

It is a bug from ray client. In this case, you can confirm it by not using ray client, but I am not sure if it is the option for you guys.

We are using ray client. There are some other options to start tasks from python?

There’s a possibility none of node has {1 CPU + 1 worker1Gi} resources…

That’s when scaling should kick in, right? Ray operator should then launch new k8s pod with such resources (and underlying k8s node autoscaling would launch new node if there is pending pod). Or am I missing something?

But for now, you can verify it using ray.state.state._available_resources_per_node() (note that it is a private API, and it is encourage to use official APIs from the above REP once they are available).

We tried and it resulted in following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/state.py", line 678, in _available_resources_per_node
    self._check_connected()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/state.py", line 49, in _check_connected
    raise ray.exceptions.RaySystemError(
ray.exceptions.RaySystemError: System error: Ray has not been started yet. You can start Ray with 'ray.init()'.

Even though ray.init() has been called and for example ray.cluster_resources() works

Filip_K · May 12, 2022, 9:04am

We are using k8s and autoscaling is one of the main features that we use. Our workflow is that we run a lot of tasks in bursts couple times a day.

Dmitri · May 13, 2022, 3:03am

A couple of things that would help here:
The configs used to launch the cluster.

If at all possible, a minimal reproduction of the problem.

Filip_K · May 13, 2022, 8:24am

I’ll try to prepare reproduction. Is it okay to send it to you via Slack?

Dmitri · May 13, 2022, 10:47pm

Slack works, but the Ray GitHub is better!

Topic		Replies	Views
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2272	May 25, 2023
Worker in pending state when running ray with minikube Kubernetes	3	978	April 13, 2021
Ray cluster's worker node is pending Ray Clusters	2	1257	February 8, 2022
Autoscaling is very slow and not working correctly Ray Clusters	6	634	April 30, 2021
Some workers are not assigned to any task Ray Core	10	632	May 24, 2021

Pending tasks not starting up

Related topics