Ray tasks sometimes hang in PENDING_NODE_ASSIGNMENT

theicfire · January 3, 2023, 8:05pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’m finding that sometimes tasks sometimes get stuck in PENDING_NODE_ASSIGNMENT and never start running. I’m wondering how to debug this.

My setup
A single machine with 8 GPUs, one docker instance per GPU. 7 of the docker instances are ray workers, the 8th is the ray head.

What I observe
When I submit a task, one of the following happens:

The task runs as expected in about one second
The tasks hangs for a minute+, and then runs
The task hangs in PENDING_NODE_ASSIGNMENT, and then after 15 minutes I get the following error:

Task failed due to an exception
Traceback (most recent call last):
  File "broker.py", line 57, in get_task_result
    train_result: Any = ray.get(task)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/worker.py", line 462, in _get
    raise err
ray.exceptions.RuntimeEnvSetupError: Failed to set up runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Failed to request agent.

Oddly, these 3 behaviors all show up when I’m running what seems to me to be near-identical tasks.

I’ve also noticed that while a task is blocked, if I submit another task, that newer task may actually be scheduled. So there’s no system-wide blocking. In addition, this sort of blocking can happen when there’s just a single task.

My question
How do I go about debugging this? I don’t know the Ray internals well enough to have any ideas on the directions I should pursue.

yic · January 4, 2023, 7:55pm

Could you check the log of runtime env setup? It should be at /tmp/ray/session_latest/logs/. I’m not sure why the runtime env setup failed.

Another thing to try is to bake your runtime env into the docker image and don’t use runtime env.

But please check the logs of runtime env first so that we can know what happened.

theicfire · January 4, 2023, 8:43pm

The problem has stopped happening since I restarted the Ray nodes. I’ll do that when the issue comes up again. Thank you!

theicfire · January 5, 2023, 8:58pm

Alright the problem’s back. I took a video of the issue: Loom | Free Screen & Video Recording Software | Loom

I tried looking in /tmp/ray/session_latest/logs/:

There’s a file called runtime_env_setup-ray_client_server_23000.log, but it’s empty
I searched for “runtime” in all files in that directory, but nothing looked that helpful

Any other debugging ideas?

theicfire · January 5, 2023, 9:19pm

raylet.out has this:

theicfire · January 5, 2023, 9:41pm

Runtime_env fails when running Ray in Docker looks like a similar issue, but all my ports are shared between all my ray workers. Here’s me checking the dashboard agent ports are good:

sangcho · January 6, 2023, 2:09am

so you have 1 machine, 7 containers connected to the head node (which is running from a machine)?

When you download the runtime env to worker nodes (docker container) your worker nodes need to ping GCS which runs on a head node. I think the issue is your docker container somehow cannot connect to the head node GCS.

If you setup the cluster and go to the dashboard (type localhost:8265 from your browser). do you see all 7 worker nodes?

theicfire · January 6, 2023, 6:21pm

I’ve got 7 docker ray-workers and 1 docker ray-head. So that means there’s 8 total workers, because the ray-head also has a worker. Those 8 seem to show up in the dashboard as expected:

rliaw · January 6, 2023, 8:25pm

Note that they all have the same IP for some reason.

cc @sangcho this is probably yet another public/private IP problem

theicfire · January 6, 2023, 8:44pm

Mm! Does Ray not support running workers on the same ip address? I’ve gone ahead and put all the docker instances on different ip addresses, maybe that’ll help.

sangcho · January 8, 2023, 4:34pm

Let us know how that works after you change the IP addr!

theicfire · January 8, 2023, 7:38pm

This seems like the fix . Thank you all! I’ve been trying things off-and-off for the past few days and I can’t reproduce the problem anymore. It’s hard to be certain, because the issue was sporadic, but this is the most stable I’ve seen it.

rliaw · January 9, 2023, 9:10am

@sangcho can we raise an error or warning something if a user does this (many raylets with same IPs)? Seems like an easy gotcha.

Topic		Replies	Views
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2116	May 25, 2023
Join tasks getting stuck in PENDING_NODE_ASSIGNMENT Ray Data	7	57	May 21, 2025
Streaming_split map_tasks stuck in pending node assignment forever Ray Data	0	308	October 23, 2023
Pending tasks not starting up Kubernetes	7	1464	May 13, 2022
Ray tasks lost on node failiure, how to debug? Ray Core	5	631	June 17, 2021

Ray tasks sometimes hang in PENDING_NODE_ASSIGNMENT

Related topics