Using custom path to conda installation with `runtime_env`

I’m setting up a ray Kubernetes cluster. I have conda environments installed on a persistent disk that I’d like my workers to use. I’m trying to use the runtime_env feature to activate these environments with:

ray.init(
        "ray://<cluster-ip>:10001",
        runtime_env={
            "conda": "domino"
        },
    )

However, I get RuntimeError: Starting up Server Failed! Check ray_client_server_[port].err on the cluster.
When I check the logs, I see Could not find conda environment: domino.

This is unsurprising – CONDA_EXE="/home/ray/anaconda3/bin/conda" on the workers, while my environments live in a conda installation at /pd/common/envs/conda/bin/conda (note: I’ve confirmed that pd is properly mounted).

My question is: how can I specify a custom path to a conda installation when using runtime_env?

I’ve tried, to no avail, setting the environment variables RAY_CONDA_HOME and CONDA_EXE in the runtime_env dictionary:

runtime_env={
            "env_vars": {
                "RAY_CONDA_HOME": "/home/common/envs/conda/bin/conda",
                "CONDA_EXE": "/home/common/envs/conda/bin/conda"
            },
            "conda": "domino"
        },

Any help would be appreciated!

Thanks!

I just tried creating a custom image that sets these environment variables.
Dockerfile:

FROM  rayproject/ray:nightly-py38

ENV CONDA_EXE=/pd/common/envs/conda/bin/conda
ENV RAY_CONDA_HOME=/pd/common/envs/conda

RUN echo 'export PATH=/pd/common/envs/conda/bin:$PATH' >> /home/ray/.bashrc
RUN echo 'export CONDA_EXE=/pd/common/envs/conda/bin/conda' >> /home/ray/.bashrc

Using this image for the head and the worker solves the above issue: the right conda installation is being used and the domino environment presumably is found.

However, in the actual remote function, the wrong environment seems to be activated.

@ray.remote(
    runtime_env={"conda": "domino"},
)
def gethostname(x):
    import os
    import platform
    import sys
    import time
    print(sys.executable)
    time.sleep(0.01)
    return x + (platform.node(),)

This prints the base conda python, not the one for the “domino” environment.

In addition, the new image I built is causing other weird behavior. For example, I’m unable to connect to the ray dashboard:

E0920 18:10:37.437134   47273 portforward.go:400] an error occurred forwarding 8265 -> 8265: error forwarding port 8265 to pod 75d79cf91892e52a742a66027df53604349b75f3390413bc4dd2f5aa9e735638, uid : failed to execute portforward in network namespace "/var/run/netns/cni-b1d3fa65-f13c-e388-705e-c1ffb7a30c52": failed to dial 8265: dial tcp4 127.0.0.1:8265: connect: connection refused
E0920 18:10:37.438783   47273 portforward.go:400] an error occurred forwarding 8265 -> 8265: error forwarding port 8265 to pod 75d79cf91892e52a742a66027df53604349b75f3390413bc4dd2f5aa9e735638, uid : failed to execute portforward in network namespace "/var/run/netns/cni-b1d3fa65-f13c-e388-705e-c1ffb7a30c52": failed to dial 8265: dial tcp4 127.0.0.1:8265: connect: connection refused

What is the best way to get custom code dependencies on the workers?

In the ray documentation it explains “To achieve this, you can build a custom container image, using one of the official Ray images as the base.” But I’m unsure how to do build such an image, given that the simple image I created above is causing unexpected issues that are hard to debug.