Ray Cluster on a Docker Swarm (manual setup)

Hello.

I’m having trouble getting a Ray Cluster working behind a Docker Swarm. I can init the ray head and the workers, all of them showing up in the dashboard. As soon as I try to connect a client, I get this error:

ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.

Increasing the logging level, I can see the following:

2022-04-27 07:07:32,020	DEBUG worker.py:314 -- client gRPC channel state change: ChannelConnectivity.IDLE
2022-04-27 07:07:32,222	DEBUG worker.py:314 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2022-04-27 07:07:32,225	DEBUG worker.py:314 -- client gRPC channel state change: ChannelConnectivity.READY
2022-04-27 07:07:32,225	DEBUG worker.py:660 -- Pinging server.
2022-04-27 07:07:34,825	DEBUG worker.py:532 -- Retaining 00ffffffffffffffffffffffffffffffffffffff0100000001000000
2022-04-27 07:07:34,826	DEBUG worker.py:459 -- Scheduling name: "get_dashboard_url"
payload_id: "\000\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\001\000\000\000\001\000\000\000"
options {
  pickled_options: "\200\003}q\000(X\013\000\000\000num_returnsq\001NX\010\000\000\000num_cpusq\002K\000X\010\000\000\000num_gpusq\003NX\006\000\000\000memoryq\004NX\023\000\000\000object_store_memoryq\005NX\020\000\000\000accelerator_typeq\006NX\t\000\000\000resourcesq\007NX\013\000\000\000max_retriesq\010NX\020\000\000\000retry_exceptionsq\tNX\017\000\000\000placement_groupq\nX\007\000\000\000defaultq\013X\034\000\000\000placement_group_bundle_indexq\014J\377\377\377\377X#\000\000\000placement_group_capture_child_tasksq\rNX\013\000\000\000runtime_envq\016NX\004\000\000\000nameq\017X\000\000\000\000q\020X\023\000\000\000scheduling_strategyq\021Nu."
}
baseline_options {
  pickled_options: "\200\003}q\000(X\010\000\000\000num_cpusq\001K\001X\010\000\000\000num_gpusq\002NX\t\000\000\000max_callsq\003K\000X\013\000\000\000max_retriesq\004K\003X\t\000\000\000resourcesq\005NX\020\000\000\000accelerator_typeq\006NX\013\000\000\000num_returnsq\007K\001X\006\000\000\000memoryq\010NX\013\000\000\000runtime_envq\tNX\023\000\000\000scheduling_strategyq\nNu."
}

2022-04-27 07:07:34,835	DEBUG worker.py:532 -- Retaining c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 6517ee019f808f3e8d8701756b9cd92738a3726801000000 Worker ID: 2c67780c16348a36d2cc0109d00a1e90b36f4e43159e8e061a79470f Node ID: 07776b753bca533a3e634abbd7055fb34098a673f26fb964d9bf184c Worker IP address: ray.a100.int.allencell.org Worker port: 20002 Worker PID: 941
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 384f82286a9c80c2c558839523431ca8c9226e7c01000000 Worker ID: a6072f171b6ffcfcb0a6b666e7f6dfe45c1c31b34437d5c8bebf76b8 Node ID: 07776b753bca533a3e634abbd7055fb34098a673f26fb964d9bf184c Worker IP address: ray.a100.int.allencell.org Worker port: 20003 Worker PID: 977
2022-04-27 07:07:36,842	DEBUG worker.py:364 -- Internal retry for get [ClientObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000)]
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 96590e244c5dd52d0063008f9aeb917801aaa05801000000 Worker ID: 93869105795baf443833a16a85e45c8adedecd4533b652d0969e8332 Node ID: 07776b753bca533a3e634abbd7055fb34098a673f26fb964d9bf184c Worker IP address: ray.a100.int.allencell.org Worker port: 20004 Worker PID: 1008
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 981414b51b438a40f761d9e1910fcf6eb2f7427c01000000 Worker ID: 76c178bd4017df99d7d784d156cfbf768c88029d55cc87b60e6085d3 Node ID: 07776b753bca533a3e634abbd7055fb34098a673f26fb964d9bf184c Worker IP address: ray.a100.int.allencell.org Worker port: 20005 Worker PID: 1043
  File "/home/guilherme.pires/.cache/pypoetry/virtualenvs/ray-utils--V89UuFz-py3.7/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/guilherme.pires/.cache/pypoetry/virtualenvs/ray-utils--V89UuFz-py3.7/lib/python3.7/site-packages/ray/worker.py", line 800, in init
    return builder.connect()
  File "/home/guilherme.pires/.cache/pypoetry/virtualenvs/ray-utils--V89UuFz-py3.7/lib/python3.7/site-packages/ray/client_builder.py", line 157, in connect
    dashboard_url = ray.get(get_dashboard_url.options(num_cpus=0).remote())
  File "/home/guilherme.pires/.cache/pypoetry/virtualenvs/ray-utils--V89UuFz-py3.7/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/guilherme.pires/.cache/pypoetry/virtualenvs/ray-utils--V89UuFz-py3.7/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/home/guilherme.pires/.cache/pypoetry/virtualenvs/ray-utils--V89UuFz-py3.7/lib/python3.7/site-packages/ray/util/client/worker.py", line 359, in get
    res = self._get(to_get, op_timeout)
  File "/home/guilherme.pires/.cache/pypoetry/virtualenvs/ray-utils--V89UuFz-py3.7/lib/python3.7/site-packages/ray/util/client/worker.py", line 386, in _get
    raise err

It seems to have to do with not being able to run or get the result to this:

dashboard_url = ray.get(get_dashboard_url.options(num_cpus=0).remote())

I tried checking the logs suggested in the error messages but can’t make anything of them. This happens even if I create my cluster with just the head node, and I tried it with ray 1.11 and 1.12.

Might this have to do with my network configurations? I’m publishing the ports recommended in the ray docs so i don’t know what I’m missing. Also struggling with how to further debug this.

On a slightly related topic, while debugging the above problem, I tried running the ray head with --num-cpus=0 to avoid scheduling work in the head node, but it still shows up in the dashboard with the full number of CPUs and was still trying to alllocate work in the head node (I know this because the error messages from the problem above mentioned workers on the head node)