So, I am just trying to setup the run time env on the worker nodes and I submit the job using:
ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0", "torch==1.10.2"]}' -- "python test_ray_script.py"
And this comes back with:
-------------------------------------------------------
Job 'raysubmit_7wziNAjmKgDLXtat' submitted successfully
-------------------------------------------------------
and then it fails with:
---------------------------------------
Job 'raysubmit_7wziNAjmKgDLXtat' failed
---------------------------------------
Status message: runtime_env setup failed: The runtime_env failed to be set up.
The logs generated are not much helpful either.
When I run it without "torch==1.10.2"
dependency, it is ok. I can install torch fine by connecting to the pod as:
kubectl attach test-pod -c test-pod -i -t -n ray
pip install torch==1.10.2
Hi @pamparana , which Ray version are you using? Also, are you able to access the logs on the head node of your cluster? By default they are at /tmp/ray/session_latest/logs
. There you may find dashboard_agent.log
or other runtime env setup logs which should contain the traceback from pip
.
I am on ray 1.10.0. Same as the cloud deployment.
So the error message is not the most helpful. Looking into raylet.err
on the head node, I have:
[2022-02-17 17:25:47,224 E 185 185] agent_manager.cc:190: Failed to create runtime env: {"workingDir":"gcs://_ray_pkg_afd927254bbfea17.zip","uris":{"workingDirUri":"gcs://_ray_pkg_afd927254bbfea17.zip","pipUri":"pip://468bfc785de311165ce6b9a0297743fe9c87bdb1"},"extensions":{"_ray_commit":"5ea565317a8104c04ae7892bb9bb41c6d72f12df"},"pipRuntimeEnv":{"config":{"packages":["requests==2.25.1","torch==1.10.2"]}}}, error message: Failed to install pip requirements:
Collecting requests==2.25.1
Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting torch==1.10.2
It does not really point to what actually failed. Not sure where else to look. This was the most useful message…
Looking into the worker node, I see this:
cat raylet.err
[2022-02-17 04:30:03,988 E 128 128] agent_manager.cc:231: Failed to delete URIs, error message: Local files for URI(s) ['gcs://_ray_pkg_bd5d1281efde2694.zip'] not found.
[2022-02-17 04:30:08,852 E 128 128] agent_manager.cc:190: Failed to create runtime env: {"extensions": {"_ray_commit": "5ea565317a8104c04ae7892bb9bb41c6d72f12df"}, "pipRuntimeEnv": {"config": {"packages": ["requests==2.26.0", "torch==1.10.1"]}}, "uris": {"pipUri": "pip://117f8e5b8c624ffefb6479d46f89922b934ca89e", "workingDirUri": "gcs://_ray_pkg_bd5d1281efde2694.zip"}, "workingDir": "gcs://_ray_pkg_bd5d1281efde2694.zip"}, error message: [Errno 2] No such file or directory: '/tmp/ray/session_2022-02-17_03-41-34_828948_115/runtime_resources/pip/117f8e5b8c624ffefb6479d46f89922b934ca89e'
[2022-02-17 04:30:08,853 E 128 128] worker_pool.cc:623: [Eagerly] Couldn't create a runtime environment for job 03000000.
[2022-02-17 04:30:08,853 E 128 128] agent_manager.cc:190: Failed to create runtime env: {"extensions": {"_ray_commit": "5ea565317a8104c04ae7892bb9bb41c6d72f12df"}, "pipRuntimeEnv": {"config": {"packages": ["requests==2.26.0", "torch==1.10.1"]}}, "uris": {"pipUri": "pip://117f8e5b8c624ffefb6479d46f89922b934ca89e", "workingDirUri": "gcs://_ray_pkg_bd5d1281efde2694.zip"}, "workingDir": "gcs://_ray_pkg_bd5d1281efde2694.zip"}, error message: [Errno 2] No such file or directory: '/tmp/ray/session_2022-02-17_03-41-34_828948_115/runtime_resources/pip/117f8e5b8c624ffefb6479d46f89922b934ca89e'
[2022-02-17 04:30:08,853 E 128 128] worker_pool.cc:623: [Eagerly] Couldn't create a runtime environment for job 03000000.
sangcho
February 19, 2022, 2:15pm
4
There must be a log file under log directory that contains runtime env specific log runtime_env_setup-[job_id].log
→ I believe it is this
This log file on the head node, unfortunately does not give much information. I see:
2022-02-18 05:28:53,472 INFO conda_utils.py:198 -- Collecting requests==2.25.1
2022-02-18 05:28:53,490 INFO conda_utils.py:198 -- Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
2022-02-18 05:28:53,625 INFO conda_utils.py:198 -- Collecting torch==1.10.2
sangcho
February 21, 2022, 12:57am
6
Can you also check dashboard_agent.log
?
That does not show anything special either.
2022-02-19 23:34:33,609 INFO agent.py:191 -- Dashboard agent http address: 0.0.0.0:41335
2022-02-19 23:34:33,609 INFO agent.py:199 -- <ResourceRoute [GET] <StaticResource /logs -> PosixPath('/tmp/ray/session_2022-02-19_23-34-29_285231_420/logs')> -> <bound method StaticResource._handle of <StaticResource /logs -> PosixPath('/tmp/ray/session_2022-02-19_23-34-29_285231_420/logs')>>
2022-02-19 23:34:33,609 INFO agent.py:199 -- <ResourceRoute [OPTIONS] <StaticResource /logs -> PosixPath('/tmp/ray/session_2022-02-19_23-34-29_285231_420/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7feb3ed96160>>
2022-02-19 23:34:33,609 INFO agent.py:200 -- Registered 2 routes.
I am facing similar issue, looks like torch installation is failing because installation command for torch on linux machine is as follows,
pip3 install torch==1.10.2+cpu torchvision==0.11.3+cpu torchaudio==0.10.2+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
. Source : Start Locally | PyTorch
I am not sure how to add this line to requirements.txt or runtime_env
1 Like
Ah thanks for finding this! It’s too bad that pip
doesn’t show an error message in this case (it seems to hang at Collecting torch==1.10.2
according to the traceback posted earlier.)
I am not sure how to add this line to requirements.txt or runtime_env
python - How to format requirements.txt when package source is from specific websites? - Stack Overflow has a detailed answer; you can just include it as a line in requirements.txt.
Note that we unfortunately had a regression for including such options in requirements.txt
in Ray 1.10 ([Bug] Python: --extra-index-url does not work in requirements file anymore · Issue #22056 · ray-project/ray · GitHub ), so you would need to use Ray 1.9 or earlier or use Ray 1.11.0rc0 (pip install ray[default]==1.11.0rc0
) for this to work, or use a nightly build that contains the bugfix. Alternatively, you can set the -f
--find-links
flag using an environment variable: Configuration - pip documentation v23.3.1 . But note that this environment variable should be set on the cluster before Ray is started (e.g. with ray start
), otherwise the runtime env pip process won’t see it.
1 Like
Thanks @architkulkarni . Adding to the alternatives, installing pytorch with conda and rest of the requirements using pip seems to work for me
runtime_env = (
{ "conda": {"dependencies": ["pytorch", {"pip": [list_of_reqs or path_to_reqs.txt file]}]},
...
..