Failed to create runtime_env for Ray client server: [Errno 2] No such file or directory

Hi,

I follow the step here to install Ray in Kubernetes using helm.
The python version is 3.7.7 and ray version is 1.9.2.

I port-forwarded the svc 10001 to my local and run the following script:

import ray
import requests

runtime_env = {"pip": ["requests", "ray[serve]"]}
ray.init(address="ray://127.0.0.1:10001", runtime_env=runtime_env)


@ray.remote
def reqs():
    return requests.get("https://www.ray.io/")


if __name__ == "__main__":
    print(ray.get(reqs.remote()))

Getting this error:

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 624, in Datapath
    client_id, job_config):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 281, in start_specific_server
    specific_server=specific_server,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 234, in _create_runtime_env
    "Failed to create runtime_env for Ray client "
RuntimeError: Failed to create runtime_env for Ray client server: [Errno 2] No such file or directory: '/tmp/ray/session_2022-02-04_08-56-23_262496_116/runtime_resources/conda/c390bf3cf7e61e5e2ee55126ce55ec8f1eb8e565'

The same script works against a local ray server. Any ideas?

Thanks in advance!

@Championzb can you try rerunning and checking the contents of the /tmp/ray/session_***/runtime_resources directory on the head node (match session_*** to the path in the error you’re seeing)

cc @architkulkarni any guesses here?

@Championzb Sorry you’re running into this, I’m not sure what the problem could be off the top of my head… Is it possible to see if the issue persists on Ray 1.10.0?

Also, are there any relevant logs in /tmp/ray/session_***/logs on the head node? For example, dashboard_agent.log or ray_client_server logs?

Hey @ckw017 @architkulkarni , thank you for the response. I’ve found the issue. By default, head node was deployed with 512Mi memory. After increasing it to 2Gi, the issue is gone.

That’s great! Any idea what exactly was running up against the memory limit? What gave you the idea to increase the memory? It could be helpful for us to know as we improve our error messages and failure handling.

Hi @architkulkarni ? No ideas, I am just running very simple script. Just accidentally notice the memory increased on dashboard when I run the job.