How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I’m finding that sometimes tasks sometimes get stuck in PENDING_NODE_ASSIGNMENT and never start running. I’m wondering how to debug this.
My setup
A single machine with 8 GPUs, one docker instance per GPU. 7 of the docker instances are ray workers, the 8th is the ray head.
What I observe
When I submit a task, one of the following happens:
The task runs as expected in about one second
The tasks hangs for a minute+, and then runs
The task hangs in PENDING_NODE_ASSIGNMENT, and then after 15 minutes I get the following error:
Task failed due to an exception
Traceback (most recent call last):
File "broker.py", line 57, in get_task_result
train_result: Any = ray.get(task)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/worker.py", line 434, in get
res = self._get(to_get, op_timeout)
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/worker.py", line 462, in _get
raise err
ray.exceptions.RuntimeEnvSetupError: Failed to set up runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Failed to request agent.
Oddly, these 3 behaviors all show up when I’m running what seems to me to be near-identical tasks.
I’ve also noticed that while a task is blocked, if I submit another task, that newer task may actually be scheduled. So there’s no system-wide blocking. In addition, this sort of blocking can happen when there’s just a single task.
My question
How do I go about debugging this? I don’t know the Ray internals well enough to have any ideas on the directions I should pursue.
Runtime_env fails when running Ray in Docker looks like a similar issue, but all my ports are shared between all my ray workers. Here’s me checking the dashboard agent ports are good:
so you have 1 machine, 7 containers connected to the head node (which is running from a machine)?
When you download the runtime env to worker nodes (docker container) your worker nodes need to ping GCS which runs on a head node. I think the issue is your docker container somehow cannot connect to the head node GCS.
If you setup the cluster and go to the dashboard (type localhost:8265 from your browser). do you see all 7 worker nodes?
I’ve got 7 docker ray-workers and 1 docker ray-head. So that means there’s 8 total workers, because the ray-head also has a worker. Those 8 seem to show up in the dashboard as expected:
Mm! Does Ray not support running workers on the same ip address? I’ve gone ahead and put all the docker instances on different ip addresses, maybe that’ll help.
This seems like the fix . Thank you all! I’ve been trying things off-and-off for the past few days and I can’t reproduce the problem anymore. It’s hard to be certain, because the issue was sporadic, but this is the most stable I’ve seen it.