I am setting up a cluster on two Windows machines where I want to run Ray within a Docker environment. Both Dockers expose the following ports: 22, 500-509, 6379, 8265, 10001-10030 and 23000-23200.
I start the head node using:
ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0 --node-ip-address XXX.XXX.XXX.XXX --metrics-export-port 502 --object-manager-port 500 --node-manager-port 501 --ray-client-server-port 10001 --redis-shard-ports 503,504,505,506,507,508 --max-worker-port=10030
And the worker node using:
ray start --address=XXX.XXX.XXX.XXX:6379 --node-ip-address YYY.YYY.YYY.YYY --object-manager-port 500 --node-manager-port 501 --max-worker-port 10030 --metrics-export-port 502
where XXX.XXX.XXX.XXX and YYY.YYY.YYY.YYY are the local IP addresses of the host machines.
I am able to connect to the Ray client server and run code on the cluster fine, until I try to provide a runtime_env to ray.init()
. For example, the following code runs fine without providing the runtime_env:
from collections import Counter
import socket
import time
import ray
import camelcase
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
if __name__ == "__main__":
runtime_env = {'py_modules': [camelcase]}
ray.init(address='ray://XXX.XXX.XXX.XXX:10001', runtime_env=runtime_env)
print(ray.nodes())
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
print(' {} tasks on {}'.format(num_tasks, ip_address))
But with this runtime_env included, I get the following error messages:
2022-03-22 10:00:53,565 INFO packaging.py:363 -- Creating a file package for local directory '/home/ubuntu/.local/lib/python3.8/site-packages/camelcase'.
2022-03-22 10:00:53,567 INFO packaging.py:223 -- Pushing file package 'gcs://_ray_pkg_1397935d9fbe73aa.zip' (0.00MiB) to Ray cluster...
2022-03-22 10:00:53,584 INFO packaging.py:226 -- Successfully pushed file package 'gcs://_ray_pkg_1397935d9fbe73aa.zip'.
(raylet) [2022-03-22 10:00:43,928 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:43,940 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:43,944 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:43,945 E 131 131] (raylet) agent_manager.cc:237: Failed to delete URIs, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,121 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_1397935d9fbe73aa.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_1397935d9fbe73aa.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,720 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_1397935d9fbe73aa.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_1397935d9fbe73aa.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,929 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,941 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,945 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,946 E 131 131] (raylet) agent_manager.cc:237: Failed to delete URIs, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
This error message combined with the fact that this same program does run without errors when the two Dockers run on the same host machine instead of on two different machines, leads me to think there are more ports that should be opened in the Docker, but I cannot seem to find which ports this should be.
Any help would be appreciated as being able to provide runtime environments to the cluster is important to the application being developed.