Ray client fails when specifying Conda Environment

Hello,

I am trying to deploy a Ray cluster on Kubernetes while specifying a particular conda environment for the workers but am unable to do so.

I am using Ray v1.7.0 with python 3.8 and microk8 v1.22.2 as the Kubernetes environment, and am following the steps on Installing the Ray Operator with Helm

The cluster works great when I call:

ray.init("ray://192.168.1.191:10001")

and I am able to set runtime environments like

ray.init("ray://192.168.1.191:10001", 
        runtime_env = {"env_vars": {
                "OMP_NUM_THREADS": "32", "TF_WARNINGS": "none"
               }})

with no issue.

The problem arises when I try to specify a conda environment.

ray.init("ray://192.168.1.191:10001", 
         runtime_env = {"conda": {
                "dependencies": ["pip", {
                    "pip": ["pendulum"]
                    }]
                }})

I get the following error:

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 612, in Datapath
    if not self.proxy_manager.start_specific_server(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 269, in start_specific_server
    serialized_runtime_env_context = self._create_runtime_env(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 225, in _create_runtime_env
    raise RuntimeError(
RuntimeError: Failed to create runtime_env for Ray client server: [Errno 2] No such file or directory: '/tmp/ray/session_2021-10-14_09-40-18_105353_113/runtime_resources/conda/ray-e10fe98776459d9b5c7be1d91a5dcb02493e4749'

I have also tried adding the conda environment to the rayImage and simply calling the prebuilt conda environment. In that case I get:

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 612, in Datapath
    if not self.proxy_manager.start_specific_server(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 269, in start_specific_server
    serialized_runtime_env_context = self._create_runtime_env(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 250, in _create_runtime_env
    raise TimeoutError(
TimeoutError: CreateRuntimeEnv request failed after 5 attempts.

Any help on this would be greatly appreciated!

Hi @sfigueroa, sorry you’re running into this issue and thanks for reporting this. This looks like a bug, we’ll get a fix out as soon as possible.

Hi @sfigueroa , can you try pip install "ray[default]" and see if that fixes things? This installs some dependencies which are required for runtime environments.

We’ll add a better error message that detects when ray[default] isn’t installed, instead of failing mysteriously as you experienced. Sorry about that!

I run into the same problem with python3.7, ray-1.8.0 running locally and on Kubernetes (GKE) deployed using a current Ray Helm chart. Using pip install ray[default] locally doesn’t seem to help. Any suggestions on how to fix this?

I should add that instead of conda dependencies I am specifying pip dependencies in the runtime_env dict.

My wild guess is that this has something to do with the ephemeral nature of container storage where /tmp resides? I could be way off, though.

Dmitry

Hi @dmitry.karpeyev , sorry for the late response here. You would need ray[default] on all nodes of the cluster. Could you share the logs if you have them? There should be logs for the pip installation in dashboard_agent.log or ray_client_server_... files.