Problems with using Ray in multiple Dockers

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am attempting to set up a cluster on multiple windows machines where I want to run Ray inside Dockers.
My problem is that when using only one machine with a head node, I can successfully use it from Python, but as soon as I connect another worker node I run into problems.

To test I used two windows host machines to run the nodes. Both Dockers publish ports 22, 6379, 8265, 500-509, 10000-10030 and 23000-23050.

I then start the head node on one machine using:

ray start --head --port=6379 --dashboard-host 0.0.0.0 --node-ip-address XXX.XXX.XXX.XXX --metrics-export-port 502 --object-manager-port 500 --node-manager-port 501 --ray-client-server-port 10001 --redis-shard-ports 505,506,507,508 --min-worker-port 10002 --max-worker-port=10030 --dashboard-agent-grpc-port 503 --dashboard-agent-listen-port 504

And a worker node on the other machine using:

ray start --address=XXX.XXX.XXX.XXX:6379 --node-ip-address YYY.YYY.YYY.YYY --object-manager-port 500 --node-manager-port 501 --min-worker-port 10002 --max-worker-port 10030 --metrics-export-port 502 --dashboard-agent-grpc-port 503 --dashboard-agent-listen-port 504

Where XXX.XXX.XXX.XXX and YYY.YYY.YYY.YYY are the local IPs of the host machines respectively.

However, now when running the same Python script I get the following error message:

File "C:\Users\IvoKersten\Documents\GitHub\data-processing-lib\SQ_statistics\ray_utils.py", line 40, in initialize_ray
    ray.init(address=address, runtime_env=runtime_env, log_to_driver=log_to_driver)
  File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\_private\worker.py", line 1248, in init
    ctx = builder.connect()
  File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\client_builder.py", line 178, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\util\client_connect.py", line 47, in connect
    conn = ray.connect(
  File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\util\client\__init__.py", line 252, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\util\client\__init__.py", line 102, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\util\client\worker.py", line 838, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 678, in Datapath
    if not self.proxy_manager.start_specific_server(
  File "/usr/local/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 304, in start_specific_server
    serialized_runtime_env_context = self._create_runtime_env(
  File "/usr/local/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 253, in _create_runtime_env
    raise RuntimeError(
RuntimeError: Failed to create runtime_env for Ray client server, it is caused by:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 341, in _create_runtime_env_with_retry
    runtime_env_context = await asyncio.wait_for(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 296, in _setup_runtime_env
    await create_for_plugin_if_needed(
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 251, in create_for_plugin_if_needed
    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/py_modules.py", line 201, in create
    module_dir = await download_and_unpack_package(
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/packaging.py", line 596, in download_and_unpack_package
    raise IOError(f"Failed to fetch URI {pkg_uri} from GCS.")
OSError: Failed to fetch URI gcs://_ray_pkg_9af13bfea390025e.zip from GCS.

It appears there is some problem with communication between the nodes when sharing the provided runtime envrionment.
Looking at the logs, all ports listed are specified as open, except for the dashboard-head-grpc port listed in the dashboard.log file:

1 2022-11-11 09:36:33,722	INFO head.py:97 -- Dashboard head grpc address: 0.0.0.0:41075 
2 2022-11-11 09:36:33,730	INFO utils.py:105 -- Get all modules by type: DashboardHeadModule
3 2022-11-11 09:36:34,011	WARNING tune_head.py:23 -- tune module is not available: No module named 'tensorboard'
4 2022-11-11 09:36:34,011	INFO utils.py:138 -- Available modules: [<class 'ray.dashboard.modules.actor.actor_head.ActorHead'>, <class 'ray.dashboard.modules.event.event_head.EventHead'>, <class 'ray.dashboard.modules.healthz.healthz_head.HealthzHead'>, <class 'ray.dashboard.modules.job.job_head.JobHead'>, <class 'ray.dashboard.modules.log.log_head.LogHead'>, <class 'ray.dashboard.modules.node.node_head.NodeHead'>, <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>, <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>, <class 'ray.dashboard.modules.state.state_head.StateHead'>, <class 'ray.dashboard.modules.tune.tune_head.TuneController'>, <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>]
5 2022-11-11 09:36:34,011	INFO head.py:158 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>
6 2022-11-11 09:36:34,011	INFO head.py:158 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>
....
17 2022-11-11 09:36:34,016	INFO http_server_head.py:60 -- Setup static dir for dashboard: /usr/local/lib/python3.10/site-packages/ray/dashboard/client/build
18 2022-11-11 09:36:34,019	INFO http_server_head.py:135 -- Dashboard head http address: 172.17.0.2:8265
19 2022-11-11 09:36:34,019	INFO http_server_head.py:141 -- <ResourceRoute [GET] <PlainResource  /logical/actor_groups> -> <function ActorHead.get_actor_groups at 0x7fc33be94d30>
...
74 2022-11-11 09:36:34,029	INFO actor_head.py:111 -- Getting all actor info from GCS.
75 2022-11-11 09:36:34,034	INFO actor_head.py:137 -- Received 0 actor info from GCS.
76 2022-11-11 09:36:45,553	WARNING node_head.py:192 -- Head node is not registered even after 10 seconds. The API server might not work correctly. Please report a Github issue. Internal states :{'head_node_registration_time_s': None, 'registered_nodes': 1, 'registered_agents': 1, 'node_update_count': 100, 'module_lifetime_s': 11.538177728652954}
77 2022-11-11 09:36:50,577	WARNING node_head.py:192 -- Head node is not registered even after 10 seconds. The API server might not work correctly. Please report a Github issue. Internal states :{'head_node_registration_time_s': None, 'registered_nodes': 1, 'registered_agents': 1, 'node_update_count': 101, 'module_lifetime_s': 16.562352180480957}
...

Could this be related to the problem?
If so, how can I specify the dashboard-head-grpc port such that I can publish that port in the Docker?

Any help would be appreciated.

Thanks for the report. Did you also try configuring per the docs Configuring Ray — Ray 2.8.0

It might not require setting the addresses or it can be configured with --include-dashboard

Let us know

I believe I set all port configurations that are available, as --include-dashboard is true by default.
From Python I connect to the cluster using:

ray.init(address='ray://XXX.XXX.XXX.XXX:10001', runtime_env={'py_modules':[LOCALLY DEVELOPED PACKAGES]})

@Ivo did you manage to find a solution to this problem? I think I’m experiencing a similar issue running Ray as a collection of Docker containers (scheduled using Hashicorp Nomad).

After spinning up the cluster, using the Job Submission API, I encounter a HTTP 500 error POSTing to /api/jobs. The tail of the stack trace looks like this:

  File "/opt/conda/lib/python3.9/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

This results in a single worker being marked as ‘Dead’ (although the worker Ray container doesn’t report any issues on stdout/err). Repeat job submissions cause the same problem, with a single worker being marked as ‘Dead’ each time.

In dashboard.log I can see many warnings of the form:

WARNING node_head.py:221 -- Head node is not registered even after 10 seconds. The API server might not work correctly. Please report a Github issue. Internal states :{'head_node_registration_time_s': None, 'registered_nodes': 3, 'registered_agents': 2, 'node_update_count': 203, 'module_lifetime_s': 526.1184296607971}

Neither of the following ports are mentioned in the Port Configuration docs:

NFO head.py:135 -- Dashboard head grpc address: 0.0.0.0:45594
22023-03-20 21:38:11,420 INFO head.py:239 -- Starting dashboard metrics server on port 44227

Do these also need to be exposed to other containers? I don’t believe they’re currently configurable.