Hi, I’m trying to launch a ray cluster, ultimately with a head node and a worker node. For now, I can’t even get it to start.
I try to use ray start to launch. Then use ray status. I’m giving the node 1 core. I assume I need an address on ray status because I’m using a nonstandard temp dir.
This is ray 2.33 with 3.11.9.
(ray) [jlquinn@cccxl005 madlad400-multigpu-translate]$ ray start --head --dashboard-host 0.0.0.0 --temp-dir /tmp/ray/jlquinn
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 9.47.193.65
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='9.47.193.65:6379'
To connect to this Ray cluster:
import ray
ray.init()
To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS='http://9.47.193.65:8265' ray job submit --working-dir . -- python my_script.py
See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
To monitor and debug Ray, view the dashboard at
9.47.193.65:8265
If connection to the dashboard fails, check your firewall settings and network configuration.
(ray) [jlquinn@cccxl005 madlad400-multigpu-translate]$ ray status
Traceback (most recent call last):
File "/dccstor/jlquinn01/miniforge3/envs/ray/bin/ray", line 8, in <module>
sys.exit(main())
^^^^^^
File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2615, in main
return cli()
^^^^^
File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/ray/scripts/scripts.py", line 1995, in status
address = services.canonicalize_bootstrap_address_or_die(address)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/ray/_private/services.py", line 584, in canonicalize_bootstrap_address_or_die
raise ConnectionError(
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting the `--address` flag or `RAY_ADDRESS` environment variable.
(ray) [jlquinn@cccxl005 madlad400-multigpu-translate]$ ray status --address='9.47.193.65:6379'
No cluster status. It may take a few seconds for the Ray internal services to start up.
How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.