Ray submit fails on EKS, runtime_env creation error

Hi Team,

I have deployed Ray on EKS using helm charts. I am also specifying pip runtime_env during init.

This is how I am initializing.

ray.init(address="ray://example-cluster-ray-head:10001", namespace="ray", runtime_env={"pip":  ["ray[default]", "ray[serve]", "flair==0.9", "nltk==3.6.2"]})




Here is the complete traceback.
❯ ray submit example-full.yaml FlairModel.py
2022-01-13 14:23:54,613	INFO util.py:282 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2022-01-13 14:24:06,024	INFO util.py:282 -- setting max workers for head node type to 0
2022-01-13 14:24:06,148	INFO command_runner.py:172 -- NodeUpdater: example-cluster-ray-head-type-s59z8: Running kubectl -n ray exec -it example-cluster-ray-head-type-s59z8 -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/FlairModel.py)'
Traceback (most recent call last):
  File "/home/ray/FlairModel.py", line 2, in <module>
    ray.init(address="ray://example-cluster-ray-head:10001", namespace="ray", runtime_env={"pip":  ["ray[default]", "ray[serve]", "flair==0.9", "nltk==3.6.2"]})
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 775, in init
    return builder.connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/client_builder.py", line 155, in connect
    ray_init_kwargs=self._remote_init_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client_connect.py", line 42, in connect
    ray_init_kwargs=ray_init_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/__init__.py", line 228, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/__init__.py", line 88, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 698, in _server_init
    f"Initialization failure from server:\n{response.msg}")
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 624, in Datapath
    client_id, job_config):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 281, in start_specific_server
    specific_server=specific_server,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 234, in _create_runtime_env
    "Failed to create runtime_env for Ray client "
RuntimeError: Failed to create runtime_env for Ray client server: [Errno 2] No such file or directory: '/tmp/ray/session_2022-01-12_16-25-08_622146_119/runtime_resources/conda/6aa502b772617ee2bec7c93351970b6ed85aa479'

command terminated with exit code 1
2022-01-13 14:28:45,460	ERROR command_runner.py:182 -- NodeUpdater: example-cluster-ray-head-type-s59z8: Command failed:

  kubectl -n ray exec -it example-cluster-ray-head-type-s59z8 --'bash --login -c -i '"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/FlairModel.py)'"'"''

Thanks in advance!

Based on errors it says ConnectionAbortedError: Initialization failure from server but this is failing at exactly the line where I have mentioned runtime_env.

I logged in to head-node to install these libraries and I could not install because of unavailability of network or proxy.

I am not sure if my understanding is correct or there is any other issue.

Hi, are you saying that the head node itself doesn’t have network access? If that’s the case, the runtime env might not be able to download the necessary dependencies which is causing the error you’re seeing.