1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.43.0
- Python version: 3.11.11
- OS: Centos
- Cloud/Infrastructure: None
- Other libs/tools (if relevant): torch
3. Repro steps / sample code: (optional, but helps a lot!)
ray start --head --node-ip-address=127.0.0.1 --port=6379 --dashboard-host=127.0.0.1 --dashboard-port=8265 --num-gpus 8 --temp-dir ~/.cache/ray
ray list nodes
RAY_ADDRESS="http://127.0.0.1:8265" ray job submit --verbose --working-dir test -- python test.py
4. What happened vs. what you expected:
- Expected: Job submit successfully
- Actual: It says it’s packing test (line 575 of packing.py) and then gets stuck for 5 minutes, after which it pops up an error message. I can submit jobs successfully on other machines utilizing the same configuration. I wonder if it has something to do with the fact that I have more network interface on this machine (many network cards and IB cards). BTW, I tried not to use
ray job submit
but directlyray.init()
in python code and run the python code directly and can successfully see this job in theray list jobs
. I don’t know why the api for ray job doesn’t seem to work.
Job submission server address: http://127.0.0.1:8265
2025-03-18 13:44:30,466 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_0a62fe483960e4e9.zip.
2025-03-18 13:44:30,466 INFO packaging.py:575 -- Creating a file package for local module 'test'.
Traceback (most recent call last):
File "/root/nfs/anaconda3/envs/lmm-r1/bin/ray", line 8, in <module>
sys.exit(main())
^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2691, in main
return cli()
^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/dashboard/modules/job/cli.py", line 276, in submit
job_id = client.submit_job(
^^^^^^^^^^^^^^^^^^
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/dashboard/modules/job/sdk.py", line 250, in submit_job
self._raise_error(r)
File "/root/nfs/anaconda3/envs/lmm-r1/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
raise RuntimeError(
RuntimeError: Request failed with status code 500: No available agent to submit job, please try again later..
- I can provide more information if needed. Thanks.