Runtime_env fails when running Ray in Docker

I am setting up a cluster on two Windows machines where I want to run Ray within a Docker environment. Both Dockers expose the following ports: 22, 500-509, 6379, 8265, 10001-10030 and 23000-23200.

I start the head node using:

ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0 --node-ip-address XXX.XXX.XXX.XXX --metrics-export-port 502 --object-manager-port 500 --node-manager-port 501 --ray-client-server-port 10001 --redis-shard-ports 503,504,505,506,507,508 --max-worker-port=10030

And the worker node using:

ray start --address=XXX.XXX.XXX.XXX:6379 --node-ip-address YYY.YYY.YYY.YYY --object-manager-port 500 --node-manager-port 501 --max-worker-port 10030 --metrics-export-port 502

where XXX.XXX.XXX.XXX and YYY.YYY.YYY.YYY are the local IP addresses of the host machines.

I am able to connect to the Ray client server and run code on the cluster fine, until I try to provide a runtime_env to ray.init(). For example, the following code runs fine without providing the runtime_env:

from collections import Counter
import socket
import time
import ray
import camelcase


@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())


if __name__ == "__main__":
    runtime_env = {'py_modules': [camelcase]}
    ray.init(address='ray://XXX.XXX.XXX.XXX:10001', runtime_env=runtime_env)

    print(ray.nodes())
    print('''This cluster consists of
        {} nodes in total
        {} CPU resources in total
    '''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

    object_ids = [f.remote() for _ in range(10000)]
    ip_addresses = ray.get(object_ids)

    print('Tasks executed')
    for ip_address, num_tasks in Counter(ip_addresses).items():
        print('    {} tasks on {}'.format(num_tasks, ip_address))

But with this runtime_env included, I get the following error messages:

2022-03-22 10:00:53,565 INFO packaging.py:363 -- Creating a file package for local directory '/home/ubuntu/.local/lib/python3.8/site-packages/camelcase'.
2022-03-22 10:00:53,567 INFO packaging.py:223 -- Pushing file package 'gcs://_ray_pkg_1397935d9fbe73aa.zip' (0.00MiB) to Ray cluster...
2022-03-22 10:00:53,584 INFO packaging.py:226 -- Successfully pushed file package 'gcs://_ray_pkg_1397935d9fbe73aa.zip'.
(raylet) [2022-03-22 10:00:43,928 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:43,940 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:43,944 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:43,945 E 131 131] (raylet) agent_manager.cc:237: Failed to delete URIs, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,121 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_1397935d9fbe73aa.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_1397935d9fbe73aa.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,720 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_1397935d9fbe73aa.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_1397935d9fbe73aa.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,929 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,941 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,945 E 131 131] (raylet) agent_manager.cc:196: Failed to create the runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.
(raylet) [2022-03-22 10:00:44,946 E 131 131] (raylet) agent_manager.cc:237: Failed to delete URIs, status = GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: , maybe there are some network problems, will retry it later.

This error message combined with the fact that this same program does run without errors when the two Dockers run on the same host machine instead of on two different machines, leads me to think there are more ports that should be opened in the Docker, but I cannot seem to find which ports this should be.

Any help would be appreciated as being able to provide runtime environments to the cluster is important to the application being developed.

@architkulkarni could you help answer this question?

Hi @Ivo , sorry you’re running into this. That’s interesting that the program works when both dockers run on the same machine. The ports you listed look fine. Does the problem persist on the Ray nightly build? Also, is there any relevant information in the logs (dashboard_agent.log or runtime_env_setup.log? By default these are at /tmp/ray/session_latest/logs, at least on Linux/Mac – not 100% sure about the location on Windows.

Hi @architkulkarni, thanks for your response.

The dashboard_agent.log and runtime_env_setup-ray_client_server_23000.log files don’t seem to have any relevant information. The runtime_env_setup-ray_client_server_23000.log is empty, and the content of the dashboard_agent.log is:

2022-03-23 12:40:48,581	INFO agent.py:100 -- Parent pid is 95
2022-03-23 12:40:48,582	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:65406
2022-03-23 12:40:48,584	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-03-23 12:40:48,871	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-03-23 12:40:48,872	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-03-23 12:40:48,872	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-03-23 12:40:48,872	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-03-23 12:40:48,874	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent'>
2022-03-23 12:40:48,875	INFO agent.py:123 -- Loaded 4 modules.
2022-03-23 12:40:48,875	INFO agent.py:205 -- Dashboard agent http address: 0.0.0.0:45889
2022-03-23 12:40:48,875	INFO agent.py:213 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-03-23_12-40-44_697946_19/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-03-23_12-40-44_697946_19/logs')>>
2022-03-23 12:40:48,876	INFO agent.py:213 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-03-23_12-40-44_697946_19/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7fed688a4970>>
2022-03-23 12:40:48,876	INFO agent.py:214 -- Registered 2 routes.
2022-03-23 12:40:48,897	INFO event_agent.py:48 -- Report events to b'172.17.0.2:33729'
2022-03-23 12:40:48,898	INFO event_utils.py:123 -- Monitor events logs modified after 1648037448.593545 on /tmp/ray/session_2022-03-23_12-40-44_697946_19/logs/events, the source types are ['CORE_WORKER', 'RAYLET', 'COMMON'].
2022-03-23 12:43:47,280	INFO runtime_env_agent.py:179 -- Creating runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}
2022-03-23 12:43:47,289	INFO runtime_env_agent.py:208 -- Successfully created runtime env: {"pythonRuntimeEnv": {"pyModules": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}, "uris": {"pyModulesUris": ["gcs://_ray_pkg_e4637d32ba040e4a.zip"]}}, the context: {"command_prefix": [], "env_vars": {"PYTHONPATH": "/tmp/ray/session_2022-03-23_12-40-44_697946_19/runtime_resources/py_modules_files/_ray_pkg_e4637d32ba040e4a"}, "py_executable": "/usr/local/bin/python", "resources_dir": null, "container": {}}

On the Ray nightly build the problem persists but with a different error:

2022-03-23 13:00:03,747 INFO packaging.py:388 -- Creating a file package for local directory '/home/ubuntu/.local/lib/python3.8/site-packages/camelcase'.
2022-03-23 13:00:03,749 INFO packaging.py:241 -- Pushing file package 'gcs://_ray_pkg_6c7ede3b10b2cb86.zip' (0.00MiB) to Ray cluster...
2022-03-23 13:00:03,760 INFO packaging.py:243 -- Successfully pushed file package 'gcs://_ray_pkg_6c7ede3b10b2cb86.zip'.
Traceback (most recent call last):
  File "algos/scripts/test.py", line 27, in <module>
    ray.init(address='ray://XXX.XXX.XXX.XXX:10001', runtime_env=runtime_env)
  File "/usr/local/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/ray/worker.py", line 882, in init
    return builder.connect()
  File "/usr/local/lib/python3.8/site-packages/ray/client_builder.py", line 167, in connect
    dashboard_url = ray.get(get_dashboard_url.options(num_cpus=0).remote())
  File "/usr/local/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/ray/util/client/api.py", line 43, in get
    return self.worker.get(vals, timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/ray/util/client/worker.py", line 433, in get
    res = self._get(to_get, op_timeout)
  File "/usr/local/lib/python3.8/site-packages/ray/util/client/worker.py", line 461, in _get
    raise err
ray.exceptions.RuntimeEnvSetupError: Failed to setup runtime environment.
Failed to request agent.

The dashboard_agent.log and runtime_env_setup.log contain the same information as before, but looking into the dashboard.log file, it does contain the line

2022-03-23 12:57:52,666	INFO http_server_head.py:131 -- Dashboard head http address: 172.17.0.2:8265

where 172.17.0.2 is the internal IP of the Docker, not the IP of the host machine. Could this be related to the problem?

Hi @architkulkarni any ideas as to what I could try to resolve this problem?

Sorry for the delay, and thanks for looking into the logs! @GuyangSong do you have any ideas about what could be causing this? Is it related to the IP addresses as @Ivo suggested? It looks like this is happening in the place where the Raylet is making an RPC to the runtime env agent: ray/agent_manager.cc at 69af9764b25e76178d23579506f57f043d070a89 · ray-project/ray · GitHub

Seems the dashboard agent used the port 65406 which is not exposed?
Can you set an valid port by the command line option --dashboard-agent-grpc-port of ray start?

1 Like

Hi @GuyangSong, thank you very much. Indeed by specifying both--dashboard-agent-grpc-port and --dashboard-agent-listen-port as ports that are exposed it does work.

Do you know where I could find a full list of all command line options? I have not found these options anywhere in the Ray command line API and only encountered --dashboard-agent-listen-port by accident when trying to change the http port of the dashboard agent.

@Ivo It’s a hidden param. You can see the source code for entire params. In long term, @sangcho can we make the param --dashboard-agent-grpc-port public?