How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am attempting to set up a cluster on multiple windows machines where I want to run Ray inside Dockers.
My problem is that when using only one machine with a head node, I can successfully use it from Python, but as soon as I connect another worker node I run into problems.
To test I used two windows host machines to run the nodes. Both Dockers publish ports 22, 6379, 8265, 500-509, 10000-10030 and 23000-23050.
I then start the head node on one machine using:
ray start --head --port=6379 --dashboard-host 0.0.0.0 --node-ip-address XXX.XXX.XXX.XXX --metrics-export-port 502 --object-manager-port 500 --node-manager-port 501 --ray-client-server-port 10001 --redis-shard-ports 505,506,507,508 --min-worker-port 10002 --max-worker-port=10030 --dashboard-agent-grpc-port 503 --dashboard-agent-listen-port 504
And a worker node on the other machine using:
ray start --address=XXX.XXX.XXX.XXX:6379 --node-ip-address YYY.YYY.YYY.YYY --object-manager-port 500 --node-manager-port 501 --min-worker-port 10002 --max-worker-port 10030 --metrics-export-port 502 --dashboard-agent-grpc-port 503 --dashboard-agent-listen-port 504
Where XXX.XXX.XXX.XXX and YYY.YYY.YYY.YYY are the local IPs of the host machines respectively.
However, now when running the same Python script I get the following error message:
File "C:\Users\IvoKersten\Documents\GitHub\data-processing-lib\SQ_statistics\ray_utils.py", line 40, in initialize_ray
ray.init(address=address, runtime_env=runtime_env, log_to_driver=log_to_driver)
File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\_private\client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\_private\worker.py", line 1248, in init
ctx = builder.connect()
File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\client_builder.py", line 178, in connect
client_info_dict = ray.util.client_connect.connect(
File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\util\client_connect.py", line 47, in connect
conn = ray.connect(
File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\util\client\__init__.py", line 252, in connect
conn = self.get_context().connect(*args, **kw_args)
File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\util\client\__init__.py", line 102, in connect
self.client_worker._server_init(job_config, ray_init_kwargs)
File "C:\Users\IvoKersten\Documents\GitHub\ray\python\ray\util\client\worker.py", line 838, in _server_init
raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 678, in Datapath
if not self.proxy_manager.start_specific_server(
File "/usr/local/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 304, in start_specific_server
serialized_runtime_env_context = self._create_runtime_env(
File "/usr/local/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 253, in _create_runtime_env
raise RuntimeError(
RuntimeError: Failed to create runtime_env for Ray client server, it is caused by:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 341, in _create_runtime_env_with_retry
runtime_env_context = await asyncio.wait_for(
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 296, in _setup_runtime_env
await create_for_plugin_if_needed(
File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 251, in create_for_plugin_if_needed
size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/py_modules.py", line 201, in create
module_dir = await download_and_unpack_package(
File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/packaging.py", line 596, in download_and_unpack_package
raise IOError(f"Failed to fetch URI {pkg_uri} from GCS.")
OSError: Failed to fetch URI gcs://_ray_pkg_9af13bfea390025e.zip from GCS.
It appears there is some problem with communication between the nodes when sharing the provided runtime envrionment.
Looking at the logs, all ports listed are specified as open, except for the dashboard-head-grpc port listed in the dashboard.log
file:
1 2022-11-11 09:36:33,722 INFO head.py:97 -- Dashboard head grpc address: 0.0.0.0:41075
2 2022-11-11 09:36:33,730 INFO utils.py:105 -- Get all modules by type: DashboardHeadModule
3 2022-11-11 09:36:34,011 WARNING tune_head.py:23 -- tune module is not available: No module named 'tensorboard'
4 2022-11-11 09:36:34,011 INFO utils.py:138 -- Available modules: [<class 'ray.dashboard.modules.actor.actor_head.ActorHead'>, <class 'ray.dashboard.modules.event.event_head.EventHead'>, <class 'ray.dashboard.modules.healthz.healthz_head.HealthzHead'>, <class 'ray.dashboard.modules.job.job_head.JobHead'>, <class 'ray.dashboard.modules.log.log_head.LogHead'>, <class 'ray.dashboard.modules.node.node_head.NodeHead'>, <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>, <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>, <class 'ray.dashboard.modules.state.state_head.StateHead'>, <class 'ray.dashboard.modules.tune.tune_head.TuneController'>, <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>]
5 2022-11-11 09:36:34,011 INFO head.py:158 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>
6 2022-11-11 09:36:34,011 INFO head.py:158 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>
....
17 2022-11-11 09:36:34,016 INFO http_server_head.py:60 -- Setup static dir for dashboard: /usr/local/lib/python3.10/site-packages/ray/dashboard/client/build
18 2022-11-11 09:36:34,019 INFO http_server_head.py:135 -- Dashboard head http address: 172.17.0.2:8265
19 2022-11-11 09:36:34,019 INFO http_server_head.py:141 -- <ResourceRoute [GET] <PlainResource /logical/actor_groups> -> <function ActorHead.get_actor_groups at 0x7fc33be94d30>
...
74 2022-11-11 09:36:34,029 INFO actor_head.py:111 -- Getting all actor info from GCS.
75 2022-11-11 09:36:34,034 INFO actor_head.py:137 -- Received 0 actor info from GCS.
76 2022-11-11 09:36:45,553 WARNING node_head.py:192 -- Head node is not registered even after 10 seconds. The API server might not work correctly. Please report a Github issue. Internal states :{'head_node_registration_time_s': None, 'registered_nodes': 1, 'registered_agents': 1, 'node_update_count': 100, 'module_lifetime_s': 11.538177728652954}
77 2022-11-11 09:36:50,577 WARNING node_head.py:192 -- Head node is not registered even after 10 seconds. The API server might not work correctly. Please report a Github issue. Internal states :{'head_node_registration_time_s': None, 'registered_nodes': 1, 'registered_agents': 1, 'node_update_count': 101, 'module_lifetime_s': 16.562352180480957}
...
Could this be related to the problem?
If so, how can I specify the dashboard-head-grpc port such that I can publish that port in the Docker?
Any help would be appreciated.