Error in `ray job submit` on local machine if multiple clusters are running at the same time

@M_S, what Ray version are you using? This happens for me on both Ray 2.23.0 and 2.9.3.

Here’s a minimal repro.

  1. create test file echo 'print("Hello, World!")' >> test.py

  2. create clusters

    ray start --head --port=45521 \
      --dashboard-port=40925 --ray-client-server-port=52097
    
    ray start --head --port=45522 \
      --dashboard-port=40926 --ray-client-server-port=52098
    
  3. submit job to first cluster (runs fine):

    ray job submit --address=http://127.0.0.1:40925/ -- python test.py
    
    Job submission server address: http://127.0.0.1:40925
    
    -------------------------------------------------------
    Job 'raysubmit_SNWMzwRLriF5gQJQ' submitted successfully
    -------------------------------------------------------
    
    Next steps
      Query the logs of the job:
        ray job logs raysubmit_SNWMzwRLriF5gQJQ
      Query the status of the job:
        ray job status raysubmit_SNWMzwRLriF5gQJQ
      Request the job to be stopped:
        ray job stop raysubmit_SNWMzwRLriF5gQJQ
    
    Tailing logs until the job exits (disable with --no-wait):
    2024-06-04 21:25:56,172 INFO job_manager.py:530 -- Runtime env is setting up.
    Hello, World!
    
    ------------------------------------------
    Job 'raysubmit_SNWMzwRLriF5gQJQ' succeeded
    ------------------------------------------
    
  4. submit job to second cluster (always hangs and fails):

    ray job submit --address=http://127.0.0.1:40926/ -- python test.py
    
    RuntimeError: Request failed with status code 500: No available agent to submit job, please try again later..
    

Job submission to the second cluster always fails even if no job has been submitted to the first. This suggests that when creating multiple local clusters on a single host, all clusters after the first are broken, at least with regards to job submission. Seems like second cluster is not created with any agents at all.