Ray job submit errors on Kubernetes

I am trying to use ray job submit to launch a job on a Ray cluster running on GKE from my laptop. So far I am having some trouble with setting up the runtime environment.

I have a sample TensorFlow job which is built from the code here: Ray Train: Distributed Deep Learning — Ray 1.11.0

I am using the following command:

ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0", "tensorflow==2.5.1"]}' -- python ray-trainer.py

I always get back a generic error:

Job 'raysubmit_1XF28vZJfVx7FDqY' failed

Status message: runtime_env setup failed: The runtime_env failed to be set up.

There doesn’t seem to be any useful logs for why the environment was not set up. I also tried to change the pip field to a requirements.txt file, which is located in the same directory as I uploaded:

$ ls -la ./
total 20
drwxr-x--- 2 ricliu primarygroup 4096 Mar 17 21:43 .
drwxr-x--- 4 ricliu primarygroup 4096 Mar 14 21:49 ..
-rwxrwxrwx 1 ricliu primarygroup 2084 Mar 17 18:33 ray-trainer.py
-rw-r--r-- 1 ricliu primarygroup   32 Mar 17 19:18 requirements.txt
-rw-r----- 1 ricliu primarygroup  378 Mar 17 03:14 script.py

$ ray job submit --runtime-env-json='{"working_dir": "./", "pip": "./requirements.txt"}' -- python ray-trainer.py

But the response seems to suggest that the file is not found:

Job 'raysubmit_ik2GYGXRRFWsAG1W' failed

Status message: Error occurred while starting the job: requirements.txt is not a valid file

Does anyone know what’s wrong here?

Hi, this was fixed recently: [jobs] Make local pip/conda requirements files work with jobs by architkulkarni · Pull Request #22849 · ray-project/ray · GitHub

We should have a Ray 1.12.0 release candidate in a week or so for you to try out with the fix. You can also try using the nightly wheels: Installing Ray — Ray 1.11.0

Just realized this fix was only for the second error, not the first. I would expect the first to work with Ray 1.11.0.

Are there any relevant runtime_env setup logs on the head node? By default these are located at /tmp/ray/session_latest/logs , and the relevant files would be dashboard_agent.log , runtime_env_setup-[job_id].log , or runtime_env_setup-ray_client_server_[port].log .

The error messages are automatically propagated to the CLI on the nightly wheels as well, in case it speeds up debugging:

❯ ray job submit --runtime-env-json='{"pip": ["doesnotexist"]}' -- echo hello
Job submission server address:

Job 'raysubmit_SVzZdUCRfFcaaG8y' submitted successfully

Next steps
  Query the logs of the job:
    ray job logs raysubmit_SVzZdUCRfFcaaG8y
  Query the status of the job:
    ray job status raysubmit_SVzZdUCRfFcaaG8y
  Request the job to be stopped:
    ray job stop raysubmit_SVzZdUCRfFcaaG8y

Tailing logs until the job exits (disable with --no-wait):

Job 'raysubmit_SVzZdUCRfFcaaG8y' failed

Status message: runtime_env setup failed: Failed to setup runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Traceback (most recent call last):
  File "/Users/archit/ray/python/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 260, in CreateRuntimeEnv
    runtime_env_context = await _setup_runtime_env(
  File "/Users/archit/ray/python/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 149, in _setup_runtime_env
    size_bytes = await manager.create(
  File "/Users/archit/ray/python/ray/_private/runtime_env/pip.py", line 423, in create
    return await task
  File "/Users/archit/ray/python/ray/_private/runtime_env/pip.py", line 414, in _create_for_hash
    await PipProcessor(target_dir, runtime_env, logger)
  File "/Users/archit/ray/python/ray/_private/runtime_env/pip.py", line 326, in _run
    await self._install_pip_packages(
  File "/Users/archit/ray/python/ray/_private/runtime_env/pip.py", line 302, in _install_pip_packages
    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)
  File "/Users/archit/ray/python/ray/_private/runtime_env/utils.py", line 101, in check_output_cmd
    raise SubprocessCalledProcessError(
ray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[18] failed with the following details.
Command '['/tmp/ray/session_2022-03-17_15-22-36_124070_59807/runtime_resources/pip/8df9437b29c3f273fa6587bdffec5e399e705087/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2022-03-17_15-22-36_124070_59807/runtime_resources/pip/8df9437b29c3f273fa6587bdffec5e399e705087/requirements.txt']' returned non-zero exit status 1.
Last 50 lines of stdout:
    ERROR: Could not find a version that satisfies the requirement doesnotexist (from versions: none)
    ERROR: No matching distribution found for doesnotexist


❯ ray job submit --runtime-env-json='{"working_dir": "doesnotexist"}' -- echo hello
Job submission server address:
Traceback (most recent call last):
  File "/Users/archit/ray/python/ray/_private/runtime_env/working_dir.py", line 61, in upload_working_dir_if_needed
    working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
  File "/Users/archit/ray/python/ray/_private/runtime_env/packaging.py", line 359, in get_uri_for_directory
    raise ValueError(f"directory {directory} must be an existing" " directory")
ValueError: directory /Users/archit/ray/doesnotexist must be an existing directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/archit/anaconda3/envs/ray-py38/bin/ray", line 33, in <module>
    sys.exit(load_entry_point('ray', 'console_scripts', 'ray')())
  File "/Users/archit/ray/python/ray/scripts/scripts.py", line 2264, in main
    return cli()
  File "/Users/archit/anaconda3/envs/ray-py38/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/archit/anaconda3/envs/ray-py38/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/archit/anaconda3/envs/ray-py38/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/archit/anaconda3/envs/ray-py38/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/archit/anaconda3/envs/ray-py38/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/archit/anaconda3/envs/ray-py38/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/archit/ray/python/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/Users/archit/ray/python/ray/dashboard/modules/job/cli.py", line 152, in submit
    job_id = client.submit_job(
  File "/Users/archit/ray/python/ray/dashboard/modules/job/sdk.py", line 71, in submit_job
  File "/Users/archit/ray/python/ray/dashboard/modules/dashboard_sdk.py", line 324, in _upload_working_dir_if_needed
    upload_working_dir_if_needed(runtime_env, upload_fn=_upload_fn)
  File "/Users/archit/ray/python/ray/_private/runtime_env/working_dir.py", line 65, in upload_working_dir_if_needed
    raise ValueError(
ValueError: directory doesnotexist must be an existing directory or a zip package

Thanks for the quick reply. I actually just ran into another issue - after some amount of up time, it seems that no one is listening to the 8265 port anymore:

$ kubectl -n ray port-forward service/example-cluster-ray-head 8265:8265
Forwarding from -> 8265
Forwarding from [::1]:8265 -> 8265
Handling connection for 8265
E0317 22:21:58.501309  135973 portforward.go:406] an error occurred forwarding 8265 -> 8265: error forwarding port 8265 to pod 755dce9c462676627f07602de97f7bf9c52ab727336cda0d7d02f0b566c5d292, uid : failed to execute portforward in network namespace "/var/run/netns/cni-76ec8b4c-cc4c-49cb-aa89-f87ec7f5f9e4": failed to dial 8265: dial tcp4 connect: connection refused
E0317 22:21:58.501671  135973 portforward.go:234] lost connection to pod

Restarting the port-forwarding does not work either. Is this known?

The error I encountered above seems related to memory usage. After the memory on the head node exceeds the limit, there doesn’t seem to a way to recover the port.

For the original error, from dashboard agent logs this is what I found:

199 2022-03-17 18:24:38,214	INFO runtime_env_agent.py:179 -- Creating runtime env: {"workingDir":"gcs://_ray_pkg_8f66ee69b7e65e4b.zip","extensions":{"_ray_commit":"fec30a25dbb5f3fa81d2bf419f75f5d40bc9fc39"},"pythonRuntimeEnv":{"pipRuntimeEnv":{"config":{"packages":["requests==2.26.0","tensorflow==2.5.1"]}}},"uris":{"workingDirUri":"gcs://_ray_pkg_8f66ee69b7e65e4b.zip","pipUri":"pip://c450be8aadc76a3483ccbcce45f0470140a4d8c9"}}
200 2022-03-17 18:24:39,022	INFO conda_utils.py:198 -- Collecting requests==2.26.0
201 2022-03-17 18:24:39,046	INFO conda_utils.py:198 -- Using cached requests-2.26.0-py2.py3-none-any.whl (62 kB)
202 2022-03-17 18:24:39,357	INFO conda_utils.py:198 -- Collecting tensorflow==2.5.1
203 2022-03-17 18:24:45,269	ERROR runtime_env_agent.py:189 -- Runtime env creation failed.
204 Traceback (most recent call last):
205  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 186, in CreateRuntimeEnv
206    request.serialized_allocated_resource_instances)
207  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 148, in _setup_runtime_env
208    return await loop.run_in_executor(None, run_setup_with_logger)
209  File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
210    result = self.fn(*self.args, **self.kwargs)
211  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 106, in run_setup_with_logger
212    runtime_env, context, logger=per_job_logger)
213  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 110, in setup
214    _install_pip_list_to_dir(pip_packages, target_dir, logger=logger)
215  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 47, in _install_pip_list_to_dir
216    f"Failed to install pip requirements:\n{output}")
217 RuntimeError: Failed to install pip requirements:
218 Collecting requests==2.26.0
219 Using cached requests-2.26.0-py2.py3-none-any.whl (62 kB)
220 Collecting tensorflow==2.5.1

I debugged this a bit further and I think the original error happened because the same dependencies were already present on the nodes.

If I deploy a brand new cluster then I can get past the failure above. But if I submit the same job again with the same runtime_env_json, I will get the above error. Retrying with a different runtime_env_json does not run into the same error.

Is there a recommended pattern for managing dependencies on long-running clusters?

Sorry for the late reply here–this is definitely the recommended approach for a long-running cluster. The failure should not be happening on the second time; it should reuse the same runtime_env. Is this happening on Ray 1.12? Happy to help debug further. I’m also curious if the issue is still reproducible with less heavyweight pip requirements. The reason is that the pip output just halts in the middle of “Collecting tensorflow”, which could be some sort of memory limit.

I’m currently experiencing the port-forwarding error.

an error occurred forwarding 8265 -> 8265: error forwarding port 8265 to pod...

Increasing memory limits didn’t seem to help… Will investigate some more.