Error in `ray job submit` on local machine if multiple clusters are running at the same time

M_S · May 17, 2024, 5:55am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi there,

we have a server, where we simultaneously run multiple ray clusters.

Clusters are started like this:

ray start --head --num-gpus=0 --temp-dir=/tmp/ray --port=45521 --dashboard-port=40925 --ray-client-server-port=52097

If I have one cluster running, I can easily submit a job via:

ray job submit --no-wait --address=http://127.0.0.1:40925/ -- python ray_cluster_example.py

and the job runs without problems.

If I have two clusters running at the same time (with distinct ports of course), I run into the following error upon submitting a job to one of them:

Job submission server address: http://127.0.0.1:36403
Traceback (most recent call last):
  File "/home/aaa/dev/miniconda3/envs/xxx/bin/ray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2612, in main
    return cli()
           ^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/dashboard/modules/job/cli.py", line 273, in submit
    job_id = client.submit_job(
             ^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/dashboard/modules/job/sdk.py", line 254, in submit_job
    self._raise_error(r)
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
    raise RuntimeError(
RuntimeError: Request failed with status code 500: No available agent to submit job, please try again later..

Note that the contents of the python file do not matter, because this error happens before it is even called.

Is this a bug, or am I missing something?

Thanks!

davidxia · June 4, 2024, 8:35pm

@M_S, what Ray version are you using? This happens for me on both Ray 2.23.0 and 2.9.3.

Here’s a minimal repro.

create test file echo 'print("Hello, World!")' >> test.py

create clusters

ray start --head --port=45521 \
  --dashboard-port=40925 --ray-client-server-port=52097

ray start --head --port=45522 \
  --dashboard-port=40926 --ray-client-server-port=52098

submit job to first cluster (runs fine):

ray job submit --address=http://127.0.0.1:40925/ -- python test.py

Job submission server address: http://127.0.0.1:40925

-------------------------------------------------------
Job 'raysubmit_SNWMzwRLriF5gQJQ' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_SNWMzwRLriF5gQJQ
  Query the status of the job:
    ray job status raysubmit_SNWMzwRLriF5gQJQ
  Request the job to be stopped:
    ray job stop raysubmit_SNWMzwRLriF5gQJQ

Tailing logs until the job exits (disable with --no-wait):
2024-06-04 21:25:56,172 INFO job_manager.py:530 -- Runtime env is setting up.
Hello, World!

------------------------------------------
Job 'raysubmit_SNWMzwRLriF5gQJQ' succeeded
------------------------------------------

submit job to second cluster (always hangs and fails):

ray job submit --address=http://127.0.0.1:40926/ -- python test.py

RuntimeError: Request failed with status code 500: No available agent to submit job, please try again later..

Job submission to the second cluster always fails even if no job has been submitted to the first. This suggests that when creating multiple local clusters on a single host, all clusters after the first are broken, at least with regards to job submission. Seems like second cluster is not created with any agents at all.

Sam_Chan · June 4, 2024, 10:10pm

Just to make sure I’m understanding correctly, you are on multiple Clusters on the same localhost?

davidxia · June 4, 2024, 11:42pm

Yes, that is correct.

Sam_Chan · June 5, 2024, 7:28am

That’s not a supported scenario…it might work but multiple Ray Clusters sharing the same physical infrastructure isn’t something that is supported in all of the Core components.

We have an REP out to establish reliable virtual isolation for Ray Clusters sharing the same host but that is still work in progress (see details here: [REP] Virtual Cluster by jjyao · Pull Request #49 · ray-project/enhancements · GitHub)

davidxia · June 5, 2024, 2:05pm

@Sam_Chan, thanks! Is there a place in the Ray docs that state starting multiple local Ray clusters on a single machine isn’t recommended? Would love to see that stated a bit more explicitly if not.

This sentence in ray.init()'s docstring seems to imply starting multiple local Ray instances is OK?

If the provided address is “local”, start a new local Ray instance, even if there is already an existing local Ray instance.

— ray.init — Ray 2.23.0

Sam_Chan · June 10, 2024, 5:18pm

This is a good catch - would you be willing to submit a PR to our docs with some better language?

davidxia · June 10, 2024, 6:00pm

Like this? docs: warn about running multiple local Ray instances by davidxia · Pull Request #45836 · ray-project/ray · GitHub

M_S · June 13, 2024, 7:56am

Hi there,

thanks for the replies and thanks for picking up my question @davidxia .

I’d like to elaborate on the reason why we have multiple clusters on a single machine and ask for best practices and how to do this alternatively @Sam_Chan.

We have a server with many GPUs, where developers work to debug their code and also run preliminary experiments with some DL models. This means, that one user might need to run the code we have on their own branch. Thus, they need to launch their own ray cluster to do so, since some of the nodes/endpoints they are testing might have different behavior from the main branch. Another developer might need the main branch (or a completely different branch) behavior.

Thus, in that scenario it does not make sense to just have one ray cluster on the machine that everyone can use.

What is the recommended way to deal with this situation? The amount compute is too much to run it on the developers local machine.

Thanks!

Sam_Chan · June 18, 2024, 11:08pm

How are your developers packaging their dependencies; I would imagine each use case would have differing lib and package dependencies.

M_S · June 19, 2024, 4:55am

We mostly have a big monorepo where dependencies are shared.

If dependencies differ, they just change their conda environment for developing. So basically from a single dev perspective, each whole cluster has one big set of shared dependencies, but this will likely differ between devs.

M_S · July 5, 2024, 5:21am

So are there any recommendations @Sam_Chan?
Thanks!

Sam_Chan · July 5, 2024, 10:31pm

See above @M_S ; Ray would need to support Virtual Clusters to ensure reliability in the physical resource sharing you described.

We have the project scoped right now but not currently working on it. We’ll revisit prioritizations beginning of August.

jjyao · July 9, 2024, 7:31pm

This is because there is dashboard agent port conflicts. You need to specify distinct ports for these two ray instances: Configuring Ray — Ray 2.31.0

davidxia · July 9, 2024, 7:56pm

@jjyao, are you referring to --dashboard-port? If so, my minimal repro above uses different ones. Any explanation of why the second cluster in that example doesn’t have agents?

jjyao · July 9, 2024, 8:17pm

no, it’s --dashboard-agent-listen-port

davidxia · July 10, 2024, 11:57am

Thanks, that worked for me!

create test file echo 'print("Hello, World!")' >> test.py

create clusters

ray start --head --port=45521 \
  --dashboard-port=40925 \
  --dashboard-agent-listen-port=52365 \
  --ray-client-server-port=52097

ray start --head --port=45522 \
  --dashboard-port=40926 \
  --dashboard-agent-listen-port=52366 \
  --ray-client-server-port=52098

submit job to first cluster (successful):

ray job submit --address=http://127.0.0.1:40925/ -- python test.py

Job submission server address: http://127.0.0.1:40925

-------------------------------------------------------
Job 'raysubmit_SNWMzwRLriF5gQJQ' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_SNWMzwRLriF5gQJQ
  Query the status of the job:
    ray job status raysubmit_SNWMzwRLriF5gQJQ
  Request the job to be stopped:
    ray job stop raysubmit_SNWMzwRLriF5gQJQ

Tailing logs until the job exits (disable with --no-wait):
2024-06-04 21:25:56,172 INFO job_manager.py:530 -- Runtime env is setting up.
Hello, World!

------------------------------------------
Job 'raysubmit_SNWMzwRLriF5gQJQ' succeeded
------------------------------------------

submit job to second cluster (successful):

ray job submit --address=http://127.0.0.1:40926/ -- python test.py

Job submission server address: http://127.0.0.1:40926

-------------------------------------------------------
Job 'raysubmit_LJpfSimruNCb6kyb' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_LJpfSimruNCb6kyb
  Query the status of the job:
    ray job status raysubmit_LJpfSimruNCb6kyb
  Request the job to be stopped:
    ray job stop raysubmit_LJpfSimruNCb6kyb

Tailing logs until the job exits (disable with --no-wait):
2024-07-10 11:53:19,976	INFO job_manager.py:530 -- Runtime env is setting up.
Hello, World!

------------------------------------------
Job 'raysubmit_LJpfSimruNCb6kyb' succeeded
------------------------------------------

davidxia · July 10, 2024, 12:32pm

@jjyao even though it’s possible to run multiple Ray instances on a single machine, is it not recommended? Both @Sam_Chan here and @Stephanie_Wang in this other discussion thread have said it’s not recommended. If it’s not, I think it’s helpful for the ray.init() documentation to state that, right?

I also commented in the PR docs: warn about running multiple local Ray instances if you have time to take a look again.

Topic		Replies	Views
Ray job submit API doesn't work well Dashboard, Monitoring & Debugging	3	205	August 6, 2025
Ray Head restarting and leaving behind zombie processes Ray Clusters	0	138	March 12, 2024
Ray job submit errors on Kubernetes Ray Core	15	2389	June 28, 2022
Ray crashes on Slurm Ray Clusters	6	1393	October 27, 2022
How to start multiple ray instances on one machine with `ray.init()`? Ray Clusters	0	258	July 10, 2024

Error in `ray job submit` on local machine if multiple clusters are running at the same time

Related topics