Failed to lease worker from node

I have been struggling for days with ray trying to basically import pytorch and close to giving up and finding another solution.

Basically my submission is as follows:

ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.25.1", "torch==1.10.2", "torchvision==0.11.3"]}' -- "python test_ray_script.py"
Job submission server address: http://127.0.0.1:8265
2022-02-21 16:39:29,898	INFO sdk.py:189 -- Uploading package gcs://_ray_pkg_cc1f90b5cd6565c7.zip.
2022-02-21 16:39:29,900	INFO packaging.py:353 -- Creating a file package for local directory './'.

It usually comes back with:

-------------------------------------------------------
Job 'raysubmit_a1GgQCA5jvemK1FR' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_a1GgQCA5jvemK1FR
  Query the status of the job:
    ray job status raysubmit_a1GgQCA5jvemK1FR
  Request the job to be stopped:
    ray job stop raysubmit_a1GgQCA5jvemK1FR

Tailing logs until the job exits (disable with --no-wait):

---------------------------------------
Job 'raysubmit_a1GgQCA5jvemK1FR' failed
---------------------------------------

Logs have been pretty unhelpful so far. Looking at gcs_server.err, I see something like:

2022-02-21 07:25:20,785 E 161 161] gcs_actor_scheduler.cc:316: Failed to lease worker from node cac9e236ed0dac867f7a77d4f9b953e20fbdf42b535be63dbc5466ef for actor 3f78dc477a4de988d752876302000000(Counter.__init__) as the runtime environment setup failed, job id = 02000000
[2022-02-21 07:29:22,805 E 161 161] gcs_actor_scheduler.cc:316: Failed to lease worker from node cac9e236ed0dac867f7a77d4f9b953e20fbdf42b535be63dbc5466ef for actor 2f36444a140ac223e8c1d28a01000000(JobSupervisor.__init__) as the runtime environment setup failed, job id = 01000000
[2022-02-21 07:40:03,152 E 161 161] gcs_actor_scheduler.cc:316: Failed to lease worker from node cac9e236ed0dac867f7a77d4f9b953e20fbdf42b535be63dbc5466ef for actor 2ea9d20139d71f06e51fe3bb01000000(JobSupervisor.__init__) as the runtime environment setup failed, job id = 01000000

Could this be something or is it because of the previous error?

Looking at the log of job id=010000000

2022-02-21 07:39:53,109 INFO conda_utils.py:198 -- Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
2022-02-21 07:39:53,185 INFO conda_utils.py:198 -- Collecting torch==1.10.2
2022-02-21 07:40:02,147 INFO conda_utils.py:198 -- 

Looking at raylet.err:

Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting torch
[2022-02-21 07:40:03,152 E 218 218] agent_manager.cc:190: Failed to create runtime env: {"workingDir":"gcs://_ray_pkg_cc1f90b5cd6565c7.zip","uris":{"workingDirUri":"gcs://_ray_pkg_cc1f90b5cd6565c7.zip","pipUri":"pip://23ada401a940d4ca08aab4cb77421f32396eeb8b"},"extensions":{"_ray_commit":"5ea565317a8104c04ae7892bb9bb41c6d72f12df"},"pipRuntimeEnv":{"config":{"packages":["requests==2.25.1","torch==1.10.2","torchvision==0.11.3"]}}}, error message: Failed to install pip requirements:
Collecting requests==2.25.1
Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting torch==1.10.2

Again, says cannot install package but not really what went wrong.

I have spent 3 days on this without making much progress and at this point close to trying some other tool.

From the log, it seems that runtime env can’t be set up here. It seems like the pip failed to be installed.
Could you check the logs of ray_client_server.[err|out]?

cc @architkulkarni

I come across this post Cannot setup the runtime env - #9 by architkulkarni
It seems like the same issue. Could you follow the instruction there?

Is this the same issue as Cannot setup the runtime env - #9 by architkulkarni? Could you check the latest response there? It shows that installing pytorch requires an extra pip option.

Another idea, which is more of a guess: it looks like the pip install is failing in the middle. Usually the logs say Killed when this happens, I’m not sure why it didn’t show up in the logs though. Maybe try increasing the memory on the node? python - pip install - killed - Stack Overflow

As another option, if you are okay with manually preinstalling pytorch on the cluster in advance instead of dynamically installing it per-job using runtime_env, you can do that as instead, and import pytorch will still work in your Ray script.

@pamparana

Adding to the alternatives, installing pytorch with conda and rest of the requirements using pip seems to work for me

runtime_env = (
            {   "conda": {"dependencies": ["pytorch", "torchvision", {"pip": [list_of_reqs or path_to_reqs.txt file]}]},
             ...
            ..

@architkulkarni By installing on cluster in advance, Do you mean adding it to Ray Docker image?

I have built a custom docker image on top of Ray image as following

ARG BASE_IMAGE=rayproject/ray:1.10.0
FROM $BASE_IMAGE

RUN pip install --no-cache-dir torch==1.10.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
RUN pip install pytorch-lightning==1.5.0

and removed torch installation from requirements.txt so that runtime env will be setup without error but I still got ModuleNotFoundError.

It looked like the Ray client code is compiled before using the Docker image?

Thanks for adding the alternative!

Hmm, building it into the Docker image should definitely work. The Ray code is just Python, so there’s no compilation that happens before or after the Docker image. Does import pytorch work on the cluster as expected (without using Ray)?

(To add more detail, Ray just calls pip install, and Ray is already running in the Docker container, so the new packages would just be installed in the container at runtime but the existing packages should still be importable (see this PR which added this functionality [runtime_env] Make pip installs incremental by edoakes · Pull Request #20341 · ray-project/ray · GitHub)

Thanks @architkulkarni . Yeah. Docker is doing as expected. My deployment is on k8s and I missed updating operator image to the updated Docker image. once it is changed, it is working as expected.

1 Like