Error in `ray job submit` on local machine if multiple clusters are running at the same time

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi there,

we have a server, where we simultaneously run multiple ray clusters.

Clusters are started like this:

ray start --head --num-gpus=0 --temp-dir=/tmp/ray --port=45521 --dashboard-port=40925 --ray-client-server-port=52097

If I have one cluster running, I can easily submit a job via:

ray job submit --no-wait --address=http://127.0.0.1:40925/ -- python ray_cluster_example.py

and the job runs without problems.

If I have two clusters running at the same time (with distinct ports of course), I run into the following error upon submitting a job to one of them:

Job submission server address: http://127.0.0.1:36403
Traceback (most recent call last):
  File "/home/aaa/dev/miniconda3/envs/xxx/bin/ray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2612, in main
    return cli()
           ^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/dashboard/modules/job/cli.py", line 273, in submit
    job_id = client.submit_job(
             ^^^^^^^^^^^^^^^^^^
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/dashboard/modules/job/sdk.py", line 254, in submit_job
    self._raise_error(r)
  File "/home/aaa/dev/miniconda3/envs/xxx/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
    raise RuntimeError(
RuntimeError: Request failed with status code 500: No available agent to submit job, please try again later..

Note that the contents of the python file do not matter, because this error happens before it is even called.

Is this a bug, or am I missing something?

Thanks!

1 Like

@M_S, what Ray version are you using? This happens for me on both Ray 2.23.0 and 2.9.3.

Here’s a minimal repro.

  1. create test file echo 'print("Hello, World!")' >> test.py

  2. create clusters

    ray start --head --port=45521 \
      --dashboard-port=40925 --ray-client-server-port=52097
    
    ray start --head --port=45522 \
      --dashboard-port=40926 --ray-client-server-port=52098
    
  3. submit job to first cluster (runs fine):

    ray job submit --address=http://127.0.0.1:40925/ -- python test.py
    
    Job submission server address: http://127.0.0.1:40925
    
    -------------------------------------------------------
    Job 'raysubmit_SNWMzwRLriF5gQJQ' submitted successfully
    -------------------------------------------------------
    
    Next steps
      Query the logs of the job:
        ray job logs raysubmit_SNWMzwRLriF5gQJQ
      Query the status of the job:
        ray job status raysubmit_SNWMzwRLriF5gQJQ
      Request the job to be stopped:
        ray job stop raysubmit_SNWMzwRLriF5gQJQ
    
    Tailing logs until the job exits (disable with --no-wait):
    2024-06-04 21:25:56,172 INFO job_manager.py:530 -- Runtime env is setting up.
    Hello, World!
    
    ------------------------------------------
    Job 'raysubmit_SNWMzwRLriF5gQJQ' succeeded
    ------------------------------------------
    
  4. submit job to second cluster (always hangs and fails):

    ray job submit --address=http://127.0.0.1:40926/ -- python test.py
    
    RuntimeError: Request failed with status code 500: No available agent to submit job, please try again later..
    

Job submission to the second cluster always fails even if no job has been submitted to the first. This suggests that when creating multiple local clusters on a single host, all clusters after the first are broken, at least with regards to job submission. Seems like second cluster is not created with any agents at all.

Just to make sure I’m understanding correctly, you are on multiple Clusters on the same localhost?

Yes, that is correct.

That’s not a supported scenario…it might work but multiple Ray Clusters sharing the same physical infrastructure isn’t something that is supported in all of the Core components.

We have an REP out to establish reliable virtual isolation for Ray Clusters sharing the same host but that is still work in progress (see details here: [REP] Virtual Cluster by jjyao · Pull Request #49 · ray-project/enhancements · GitHub)

@Sam_Chan, thanks! Is there a place in the Ray docs that state starting multiple local Ray clusters on a single machine isn’t recommended? Would love to see that stated a bit more explicitly if not.

This sentence in ray.init()'s docstring seems to imply starting multiple local Ray instances is OK?

If the provided address is “local”, start a new local Ray instance, even if there is already an existing local Ray instance.

— ray.init — Ray 2.23.0

This is a good catch - would you be willing to submit a PR to our docs with some better language?

Like this? docs: warn about running multiple local Ray instances by davidxia · Pull Request #45836 · ray-project/ray · GitHub

Hi there,

thanks for the replies and thanks for picking up my question @davidxia .

I’d like to elaborate on the reason why we have multiple clusters on a single machine and ask for best practices and how to do this alternatively @Sam_Chan.

We have a server with many GPUs, where developers work to debug their code and also run preliminary experiments with some DL models. This means, that one user might need to run the code we have on their own branch. Thus, they need to launch their own ray cluster to do so, since some of the nodes/endpoints they are testing might have different behavior from the main branch. Another developer might need the main branch (or a completely different branch) behavior.

Thus, in that scenario it does not make sense to just have one ray cluster on the machine that everyone can use.

What is the recommended way to deal with this situation? The amount compute is too much to run it on the developers local machine.

Thanks!

How are your developers packaging their dependencies; I would imagine each use case would have differing lib and package dependencies.

We mostly have a big monorepo where dependencies are shared.

If dependencies differ, they just change their conda environment for developing. So basically from a single dev perspective, each whole cluster has one big set of shared dependencies, but this will likely differ between devs.

So are there any recommendations @Sam_Chan?
Thanks!

See above @M_S ; Ray would need to support Virtual Clusters to ensure reliability in the physical resource sharing you described.

We have the project scoped right now but not currently working on it. We’ll revisit prioritizations beginning of August.

This is because there is dashboard agent port conflicts. You need to specify distinct ports for these two ray instances: Configuring Ray — Ray 2.31.0

@jjyao, are you referring to --dashboard-port? If so, my minimal repro above uses different ones. Any explanation of why the second cluster in that example doesn’t have agents?

no, it’s --dashboard-agent-listen-port

Thanks, that worked for me!

  1. create test file echo 'print("Hello, World!")' >> test.py

  2. create clusters

    ray start --head --port=45521 \
      --dashboard-port=40925 \
      --dashboard-agent-listen-port=52365 \
      --ray-client-server-port=52097
    
    ray start --head --port=45522 \
      --dashboard-port=40926 \
      --dashboard-agent-listen-port=52366 \
      --ray-client-server-port=52098
    
  3. submit job to first cluster (successful):

    ray job submit --address=http://127.0.0.1:40925/ -- python test.py
    
    Job submission server address: http://127.0.0.1:40925
    
    -------------------------------------------------------
    Job 'raysubmit_SNWMzwRLriF5gQJQ' submitted successfully
    -------------------------------------------------------
    
    Next steps
      Query the logs of the job:
        ray job logs raysubmit_SNWMzwRLriF5gQJQ
      Query the status of the job:
        ray job status raysubmit_SNWMzwRLriF5gQJQ
      Request the job to be stopped:
        ray job stop raysubmit_SNWMzwRLriF5gQJQ
    
    Tailing logs until the job exits (disable with --no-wait):
    2024-06-04 21:25:56,172 INFO job_manager.py:530 -- Runtime env is setting up.
    Hello, World!
    
    ------------------------------------------
    Job 'raysubmit_SNWMzwRLriF5gQJQ' succeeded
    ------------------------------------------
    
  4. submit job to second cluster (successful):

    ray job submit --address=http://127.0.0.1:40926/ -- python test.py
    
    Job submission server address: http://127.0.0.1:40926
    
    -------------------------------------------------------
    Job 'raysubmit_LJpfSimruNCb6kyb' submitted successfully
    -------------------------------------------------------
    
    Next steps
      Query the logs of the job:
        ray job logs raysubmit_LJpfSimruNCb6kyb
      Query the status of the job:
        ray job status raysubmit_LJpfSimruNCb6kyb
      Request the job to be stopped:
        ray job stop raysubmit_LJpfSimruNCb6kyb
    
    Tailing logs until the job exits (disable with --no-wait):
    2024-07-10 11:53:19,976	INFO job_manager.py:530 -- Runtime env is setting up.
    Hello, World!
    
    ------------------------------------------
    Job 'raysubmit_LJpfSimruNCb6kyb' succeeded
    ------------------------------------------
    

@jjyao even though it’s possible to run multiple Ray instances on a single machine, is it not recommended? Both @Sam_Chan here and @Stephanie_Wang in this other discussion thread have said it’s not recommended. If it’s not, I think it’s helpful for the ray.init() documentation to state that, right?

I also commented in the PR docs: warn about running multiple local Ray instances if you have time to take a look again.