I’m setting up a ray Kubernetes cluster. I have conda environments installed on a persistent disk that I’d like my workers to use. I’m trying to use the runtime_env
feature to activate these environments with:
ray.init(
"ray://<cluster-ip>:10001",
runtime_env={
"conda": "domino"
},
)
However, I get RuntimeError: Starting up Server Failed! Check
ray_client_server_[port].err on the cluster.
When I check the logs, I see Could not find conda environment: domino
.
This is unsurprising – CONDA_EXE="/home/ray/anaconda3/bin/conda"
on the workers, while my environments live in a conda installation at /pd/common/envs/conda/bin/conda
(note: I’ve confirmed that pd
is properly mounted).
My question is: how can I specify a custom path to a conda installation when using runtime_env
?
I’ve tried, to no avail, setting the environment variables RAY_CONDA_HOME
and CONDA_EXE
in the runtime_env
dictionary:
runtime_env={
"env_vars": {
"RAY_CONDA_HOME": "/home/common/envs/conda/bin/conda",
"CONDA_EXE": "/home/common/envs/conda/bin/conda"
},
"conda": "domino"
},
Any help would be appreciated!
Thanks!
I just tried creating a custom image that sets these environment variables.
Dockerfile:
FROM rayproject/ray:nightly-py38
ENV CONDA_EXE=/pd/common/envs/conda/bin/conda
ENV RAY_CONDA_HOME=/pd/common/envs/conda
RUN echo 'export PATH=/pd/common/envs/conda/bin:$PATH' >> /home/ray/.bashrc
RUN echo 'export CONDA_EXE=/pd/common/envs/conda/bin/conda' >> /home/ray/.bashrc
Using this image for the head and the worker solves the above issue: the right conda installation is being used and the domino
environment presumably is found.
However, in the actual remote function, the wrong environment seems to be activated.
@ray.remote(
runtime_env={"conda": "domino"},
)
def gethostname(x):
import os
import platform
import sys
import time
print(sys.executable)
time.sleep(0.01)
return x + (platform.node(),)
This prints the base conda python, not the one for the “domino” environment.
In addition, the new image I built is causing other weird behavior. For example, I’m unable to connect to the ray dashboard:
E0920 18:10:37.437134 47273 portforward.go:400] an error occurred forwarding 8265 -> 8265: error forwarding port 8265 to pod 75d79cf91892e52a742a66027df53604349b75f3390413bc4dd2f5aa9e735638, uid : failed to execute portforward in network namespace "/var/run/netns/cni-b1d3fa65-f13c-e388-705e-c1ffb7a30c52": failed to dial 8265: dial tcp4 127.0.0.1:8265: connect: connection refused
E0920 18:10:37.438783 47273 portforward.go:400] an error occurred forwarding 8265 -> 8265: error forwarding port 8265 to pod 75d79cf91892e52a742a66027df53604349b75f3390413bc4dd2f5aa9e735638, uid : failed to execute portforward in network namespace "/var/run/netns/cni-b1d3fa65-f13c-e388-705e-c1ffb7a30c52": failed to dial 8265: dial tcp4 127.0.0.1:8265: connect: connection refused
What is the best way to get custom code dependencies on the workers?
In the ray documentation it explains “To achieve this, you can build a custom container image, using one of the official Ray images as the base.” But I’m unsure how to do build such an image, given that the simple image I created above is causing unexpected issues that are hard to debug.