I have been struggling for days with ray trying to basically import pytorch and close to giving up and finding another solution.
Basically my submission is as follows:
ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.25.1", "torch==1.10.2", "torchvision==0.11.3"]}' -- "python test_ray_script.py"
Job submission server address: http://127.0.0.1:8265
2022-02-21 16:39:29,898 INFO sdk.py:189 -- Uploading package gcs://_ray_pkg_cc1f90b5cd6565c7.zip.
2022-02-21 16:39:29,900 INFO packaging.py:353 -- Creating a file package for local directory './'.
It usually comes back with:
-------------------------------------------------------
Job 'raysubmit_a1GgQCA5jvemK1FR' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_a1GgQCA5jvemK1FR
Query the status of the job:
ray job status raysubmit_a1GgQCA5jvemK1FR
Request the job to be stopped:
ray job stop raysubmit_a1GgQCA5jvemK1FR
Tailing logs until the job exits (disable with --no-wait):
---------------------------------------
Job 'raysubmit_a1GgQCA5jvemK1FR' failed
---------------------------------------
Logs have been pretty unhelpful so far. Looking at gcs_server.err
, I see something like:
2022-02-21 07:25:20,785 E 161 161] gcs_actor_scheduler.cc:316: Failed to lease worker from node cac9e236ed0dac867f7a77d4f9b953e20fbdf42b535be63dbc5466ef for actor 3f78dc477a4de988d752876302000000(Counter.__init__) as the runtime environment setup failed, job id = 02000000
[2022-02-21 07:29:22,805 E 161 161] gcs_actor_scheduler.cc:316: Failed to lease worker from node cac9e236ed0dac867f7a77d4f9b953e20fbdf42b535be63dbc5466ef for actor 2f36444a140ac223e8c1d28a01000000(JobSupervisor.__init__) as the runtime environment setup failed, job id = 01000000
[2022-02-21 07:40:03,152 E 161 161] gcs_actor_scheduler.cc:316: Failed to lease worker from node cac9e236ed0dac867f7a77d4f9b953e20fbdf42b535be63dbc5466ef for actor 2ea9d20139d71f06e51fe3bb01000000(JobSupervisor.__init__) as the runtime environment setup failed, job id = 01000000
Could this be something or is it because of the previous error?
Looking at the log of job id=010000000
2022-02-21 07:39:53,109 INFO conda_utils.py:198 -- Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
2022-02-21 07:39:53,185 INFO conda_utils.py:198 -- Collecting torch==1.10.2
2022-02-21 07:40:02,147 INFO conda_utils.py:198 --
Looking at raylet.err:
Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting torch
[2022-02-21 07:40:03,152 E 218 218] agent_manager.cc:190: Failed to create runtime env: {"workingDir":"gcs://_ray_pkg_cc1f90b5cd6565c7.zip","uris":{"workingDirUri":"gcs://_ray_pkg_cc1f90b5cd6565c7.zip","pipUri":"pip://23ada401a940d4ca08aab4cb77421f32396eeb8b"},"extensions":{"_ray_commit":"5ea565317a8104c04ae7892bb9bb41c6d72f12df"},"pipRuntimeEnv":{"config":{"packages":["requests==2.25.1","torch==1.10.2","torchvision==0.11.3"]}}}, error message: Failed to install pip requirements:
Collecting requests==2.25.1
Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting torch==1.10.2
Again, says cannot install package but not really what went wrong.
I have spent 3 days on this without making much progress and at this point close to trying some other tool.