Ray job submit API doesn't work well

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.43.0
  • Python version: 3.11.11
  • OS: Centos
  • Cloud/Infrastructure: None
  • Other libs/tools (if relevant): torch

3. Repro steps / sample code: (optional, but helps a lot!)

ray start --head --node-ip-address=127.0.0.1 --port=6379 --dashboard-host=127.0.0.1 --dashboard-port=8265 --num-gpus 8 --temp-dir ~/.cache/ray
ray list nodes
RAY_ADDRESS="http://127.0.0.1:8265" ray job submit --verbose --working-dir test -- python test.py

4. What happened vs. what you expected:

  • Expected: Job submit successfully
  • Actual: It says it’s packing test (line 575 of packing.py) and then gets stuck for 5 minutes, after which it pops up an error message. I can submit jobs successfully on other machines utilizing the same configuration. I wonder if it has something to do with the fact that I have more network interface on this machine (many network cards and IB cards). BTW, I tried not to use ray job submit but directly ray.init() in python code and run the python code directly and can successfully see this job in the ray list jobs. I don’t know why the api for ray job doesn’t seem to work.
Job submission server address: http://127.0.0.1:8265
2025-03-18 13:44:30,466 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_0a62fe483960e4e9.zip.
2025-03-18 13:44:30,466 INFO packaging.py:575 -- Creating a file package for local module 'test'.
Traceback (most recent call last):
  File "/root/nfs/anaconda3/envs/lmm-r1/bin/ray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2691, in main
    return cli()
           ^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/dashboard/modules/job/cli.py", line 276, in submit
    job_id = client.submit_job(
             ^^^^^^^^^^^^^^^^^^
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/dashboard/modules/job/sdk.py", line 250, in submit_job
    self._raise_error(r)
  File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
    raise RuntimeError(
RuntimeError: Request failed with status code 500: No available agent to submit job, please try again later..
  • I can provide more information if needed. Thanks.

It seems to be triggering an asyncio.TimeoutError on line 391 of the dashboard/modules/job/job_head.py file

    @routes.post("/api/jobs/")
    async def submit_job(self, req: Request) -> Response:
        result = await parse_and_validate_request(req, JobSubmitRequest)
        # Request parsing failed, returned with Response object.
        if isinstance(result, Response):
            return result
        else:
            submit_request: JobSubmitRequest = result

        try:
            job_agent_client = await asyncio.wait_for(
                self.get_target_agent(),
                timeout=WAIT_AVAILABLE_AGENT_TIMEOUT,
            )
            resp = await job_agent_client.submit_job_internal(submit_request)
        except asyncio.TimeoutError:
            return Response(
                text="No available agent to submit job, please try again later.",
                status=aiohttp.web.HTTPInternalServerError.status_code,
            )
        except (TypeError, ValueError):
            return Response(
                text=traceback.format_exc(),
                status=aiohttp.web.HTTPBadRequest.status_code,
            )
        except Exception:
            return Response(
                text=traceback.format_exc(),
                status=aiohttp.web.HTTPInternalServerError.status_code,
            )

        return Response(
            text=json.dumps(dataclasses.asdict(resp)),
            content_type="application/json",
            status=aiohttp.web.HTTPOk.status_code,
        )

The dashboard.log shows that

2025-03-18 14:25:31,498	INFO head.py:303 -- http server initialized at 127.0.0.1:8265
2025-03-18 14:25:31,522	INFO metrics_head.py:435 -- Generated prometheus and grafana configurations in: /root/.cache/ray/session_2025-03-18_14-25-26_718918_2997427/metrics
2025-03-18 14:25:31,525	INFO event_utils.py:130 -- Monitor events logs modified after 1742277330.4554777 on /root/.cache/ray/session_2025-03-18_14-25-26_718918_2997427/logs/events, the source types are all.
2025-03-18 14:25:31,526	INFO usage_stats_head.py:201 -- Usage reporting is enabled.
2025-03-18 14:25:31,534	INFO actor_head.py:136 -- Getting all actor info from GCS.
2025-03-18 14:25:31,535	INFO actor_head.py:153 -- Received 0 actor info from GCS.
2025-03-18 14:26:15,944	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:26:15 +0800] 'GET /api/version HTTP/1.1' 200 322 bytes 532 us '-' 'python-requests/2.32.3'
2025-03-18 14:26:15,946	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:26:15 +0800] 'GET /api/version HTTP/1.1' 200 322 bytes 217 us '-' 'python-requests/2.32.3'
2025-03-18 14:26:15,952	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:26:15 +0800] 'GET /api/packages/gcs/_ray_pkg_0a62fe483960e4e9.zip HTTP/1.1' 404 219 bytes 1247 us '-' 'python-requests/2.32.3'
2025-03-18 14:26:15,955	INFO job_head.py:357 -- Uploading package gcs://_ray_pkg_0a62fe483960e4e9.zip to the GCS.
2025-03-18 14:26:15,956	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_0a62fe483960e4e9.zip' (0.00MiB) to Ray cluster...
2025-03-18 14:26:15,956	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_0a62fe483960e4e9.zip'.
2025-03-18 14:26:15,957	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:26:15 +0800] 'PUT /api/packages/gcs/_ray_pkg_0a62fe483960e4e9.zip HTTP/1.1' 200 112 bytes 1189 us '-' 'python-requests/2.32.3'
2025-03-18 14:26:31,531	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:26:31 +0800] 'GET /api/grafana_health HTTP/1.1' 500 437 bytes 1986 us '-' 'python-requests/2.32.3'
2025-03-18 14:26:31,534	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:26:31 +0800] 'GET /api/prometheus_health HTTP/1.1' 500 439 bytes 1227 us '-' 'python-requests/2.32.3'
2025-03-18 14:27:12,764	INFO usage_stats_head.py:145 -- Usage report request failed. HTTPSConnectionPool(host='usage-stats.ray.io', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f43fa58fbd0>, 'Connection to usage-stats.ray.io timed out. (connect timeout=10)'))
2025-03-18 14:27:20,769	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:27:20 +0800] 'GET /api/grafana_health HTTP/1.1' 500 437 bytes 1681 us '-' 'python-requests/2.32.3'
2025-03-18 14:27:20,771	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:27:20 +0800] 'GET /api/prometheus_health HTTP/1.1' 500 439 bytes 1092 us '-' 'python-requests/2.32.3'
2025-03-18 14:28:00,882	INFO usage_stats_head.py:145 -- Usage report request failed. HTTPSConnectionPool(host='usage-stats.ray.io', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f43fa5ac8d0>, 'Connection to usage-stats.ray.io timed out. (connect timeout=10)'))
2025-03-18 14:31:16,345	INFO web_log.py:214 -- 127.0.0.1 [18/Mar/2025:14:26:15 +0800] 'POST /api/jobs/ HTTP/1.1' 500 230 bytes 300386786 us '-' 'python-requests/2.32.3'

As you can see in the last line, the POST request to the job API took about 300 seconds, which I’m guessing triggers some sort of timeout. It also returns an abnormal code of 500.