Ray tasks sometimes hang in PENDING_NODE_ASSIGNMENT

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m finding that sometimes tasks sometimes get stuck in PENDING_NODE_ASSIGNMENT and never start running. I’m wondering how to debug this.

My setup
A single machine with 8 GPUs, one docker instance per GPU. 7 of the docker instances are ray workers, the 8th is the ray head.

What I observe
When I submit a task, one of the following happens:

  1. The task runs as expected in about one second
  2. The tasks hangs for a minute+, and then runs
  3. The task hangs in PENDING_NODE_ASSIGNMENT, and then after 15 minutes I get the following error:
Task failed due to an exception
Traceback (most recent call last):
  File "broker.py", line 57, in get_task_result
    train_result: Any = ray.get(task)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/worker.py", line 462, in _get
    raise err
ray.exceptions.RuntimeEnvSetupError: Failed to set up runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Failed to request agent.

Oddly, these 3 behaviors all show up when I’m running what seems to me to be near-identical tasks.

I’ve also noticed that while a task is blocked, if I submit another task, that newer task may actually be scheduled. So there’s no system-wide blocking. In addition, this sort of blocking can happen when there’s just a single task.

My question
How do I go about debugging this? I don’t know the Ray internals well enough to have any ideas on the directions I should pursue.

2 Likes

Could you check the log of runtime env setup? It should be at /tmp/ray/session_latest/logs/. I’m not sure why the runtime env setup failed.

Another thing to try is to bake your runtime env into the docker image and don’t use runtime env.

But please check the logs of runtime env first so that we can know what happened.

The problem has stopped happening since I restarted the Ray nodes. I’ll do that when the issue comes up again. Thank you!

Alright the problem’s back. I took a video of the issue: Loom | Free Screen & Video Recording Software | Loom

I tried looking in /tmp/ray/session_latest/logs/:

  • There’s a file called runtime_env_setup-ray_client_server_23000.log, but it’s empty
  • I searched for “runtime” in all files in that directory, but nothing looked that helpful

Any other debugging ideas?

raylet.out has this:

Runtime_env fails when running Ray in Docker looks like a similar issue, but all my ports are shared between all my ray workers. Here’s me checking the dashboard agent ports are good:

so you have 1 machine, 7 containers connected to the head node (which is running from a machine)?

When you download the runtime env to worker nodes (docker container) your worker nodes need to ping GCS which runs on a head node. I think the issue is your docker container somehow cannot connect to the head node GCS.

If you setup the cluster and go to the dashboard (type localhost:8265 from your browser). do you see all 7 worker nodes?

I’ve got 7 docker ray-workers and 1 docker ray-head. So that means there’s 8 total workers, because the ray-head also has a worker. Those 8 seem to show up in the dashboard as expected:

Note that they all have the same IP for some reason.

cc @sangcho this is probably yet another public/private IP problem

Mm! Does Ray not support running workers on the same ip address? I’ve gone ahead and put all the docker instances on different ip addresses, maybe that’ll help.

Let us know how that works after you change the IP addr!

This seems like the fix :crossed_fingers:. Thank you all! I’ve been trying things off-and-off for the past few days and I can’t reproduce the problem anymore. It’s hard to be certain, because the issue was sporadic, but this is the most stable I’ve seen it.

@sangcho can we raise an error or warning something if a user does this (many raylets with same IPs)? Seems like an easy gotcha.