Cannot setup the runtime env

So, I am just trying to setup the run time env on the worker nodes and I submit the job using:

ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0", "torch==1.10.2"]}' -- "python test_ray_script.py"

And this comes back with:

-------------------------------------------------------
Job 'raysubmit_7wziNAjmKgDLXtat' submitted successfully
-------------------------------------------------------

and then it fails with:

---------------------------------------
Job 'raysubmit_7wziNAjmKgDLXtat' failed
---------------------------------------

Status message: runtime_env setup failed: The runtime_env failed to be set up.

The logs generated are not much helpful either.

When I run it without "torch==1.10.2" dependency, it is ok. I can install torch fine by connecting to the pod as:

kubectl attach test-pod -c test-pod -i -t -n ray
pip install torch==1.10.2

Hi @pamparana, which Ray version are you using? Also, are you able to access the logs on the head node of your cluster? By default they are at /tmp/ray/session_latest/logs. There you may find dashboard_agent.log or other runtime env setup logs which should contain the traceback from pip.

I am on ray 1.10.0. Same as the cloud deployment.

So the error message is not the most helpful. Looking into raylet.err on the head node, I have:

[2022-02-17 17:25:47,224 E 185 185] agent_manager.cc:190: Failed to create runtime env: {"workingDir":"gcs://_ray_pkg_afd927254bbfea17.zip","uris":{"workingDirUri":"gcs://_ray_pkg_afd927254bbfea17.zip","pipUri":"pip://468bfc785de311165ce6b9a0297743fe9c87bdb1"},"extensions":{"_ray_commit":"5ea565317a8104c04ae7892bb9bb41c6d72f12df"},"pipRuntimeEnv":{"config":{"packages":["requests==2.25.1","torch==1.10.2"]}}}, error message: Failed to install pip requirements:
Collecting requests==2.25.1
Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting torch==1.10.2

It does not really point to what actually failed. Not sure where else to look. This was the most useful message…

Looking into the worker node, I see this:

cat raylet.err 
[2022-02-17 04:30:03,988 E 128 128] agent_manager.cc:231: Failed to delete URIs, error message: Local files for URI(s) ['gcs://_ray_pkg_bd5d1281efde2694.zip'] not found.
[2022-02-17 04:30:08,852 E 128 128] agent_manager.cc:190: Failed to create runtime env: {"extensions": {"_ray_commit": "5ea565317a8104c04ae7892bb9bb41c6d72f12df"}, "pipRuntimeEnv": {"config": {"packages": ["requests==2.26.0", "torch==1.10.1"]}}, "uris": {"pipUri": "pip://117f8e5b8c624ffefb6479d46f89922b934ca89e", "workingDirUri": "gcs://_ray_pkg_bd5d1281efde2694.zip"}, "workingDir": "gcs://_ray_pkg_bd5d1281efde2694.zip"}, error message: [Errno 2] No such file or directory: '/tmp/ray/session_2022-02-17_03-41-34_828948_115/runtime_resources/pip/117f8e5b8c624ffefb6479d46f89922b934ca89e'
[2022-02-17 04:30:08,853 E 128 128] worker_pool.cc:623: [Eagerly] Couldn't create a runtime environment for job 03000000.
[2022-02-17 04:30:08,853 E 128 128] agent_manager.cc:190: Failed to create runtime env: {"extensions": {"_ray_commit": "5ea565317a8104c04ae7892bb9bb41c6d72f12df"}, "pipRuntimeEnv": {"config": {"packages": ["requests==2.26.0", "torch==1.10.1"]}}, "uris": {"pipUri": "pip://117f8e5b8c624ffefb6479d46f89922b934ca89e", "workingDirUri": "gcs://_ray_pkg_bd5d1281efde2694.zip"}, "workingDir": "gcs://_ray_pkg_bd5d1281efde2694.zip"}, error message: [Errno 2] No such file or directory: '/tmp/ray/session_2022-02-17_03-41-34_828948_115/runtime_resources/pip/117f8e5b8c624ffefb6479d46f89922b934ca89e'
[2022-02-17 04:30:08,853 E 128 128] worker_pool.cc:623: [Eagerly] Couldn't create a runtime environment for job 03000000.

There must be a log file under log directory that contains runtime env specific log runtime_env_setup-[job_id].log → I believe it is this

This log file on the head node, unfortunately does not give much information. I see:

2022-02-18 05:28:53,472 INFO conda_utils.py:198 -- Collecting requests==2.25.1
2022-02-18 05:28:53,490 INFO conda_utils.py:198 -- Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
2022-02-18 05:28:53,625 INFO conda_utils.py:198 -- Collecting torch==1.10.2

Can you also check dashboard_agent.log?

That does not show anything special either.

2022-02-19 23:34:33,609 INFO agent.py:191 -- Dashboard agent http address: 0.0.0.0:41335
2022-02-19 23:34:33,609 INFO agent.py:199 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-02-19_23-34-29_285231_420/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-02-19_23-34-29_285231_420/logs')>>
2022-02-19 23:34:33,609 INFO agent.py:199 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-02-19_23-34-29_285231_420/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7feb3ed96160>>
2022-02-19 23:34:33,609 INFO agent.py:200 -- Registered 2 routes.

I am facing similar issue, looks like torch installation is failing because installation command for torch on linux machine is as follows,

pip3 install torch==1.10.2+cpu torchvision==0.11.3+cpu torchaudio==0.10.2+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html . Source : Start Locally | PyTorch

I am not sure how to add this line to requirements.txt or runtime_env

1 Like

Ah thanks for finding this! It’s too bad that pip doesn’t show an error message in this case (it seems to hang at Collecting torch==1.10.2 according to the traceback posted earlier.)

I am not sure how to add this line to requirements.txt or runtime_env

python - How to format requirements.txt when package source is from specific websites? - Stack Overflow has a detailed answer; you can just include it as a line in requirements.txt.

Note that we unfortunately had a regression for including such options in requirements.txt in Ray 1.10 ([Bug] Python: --extra-index-url does not work in requirements file anymore · Issue #22056 · ray-project/ray · GitHub), so you would need to use Ray 1.9 or earlier or use Ray 1.11.0rc0 (pip install ray[default]==1.11.0rc0) for this to work, or use a nightly build that contains the bugfix. Alternatively, you can set the -f --find-links flag using an environment variable: Configuration - pip documentation v22.0.3. But note that this environment variable should be set on the cluster before Ray is started (e.g. with ray start), otherwise the runtime env pip process won’t see it.

1 Like

Thanks @architkulkarni . Adding to the alternatives, installing pytorch with conda and rest of the requirements using pip seems to work for me

runtime_env = (
            {   "conda": {"dependencies": ["pytorch", {"pip": [list_of_reqs or path_to_reqs.txt file]}]},
             ...
            ..