Help with starting a local ray cluster?

Hi, I’m trying to launch a ray cluster, ultimately with a head node and a worker node. For now, I can’t even get it to start.

I try to use ray start to launch. Then use ray status. I’m giving the node 1 core. I assume I need an address on ray status because I’m using a nonstandard temp dir.

This is ray 2.33 with 3.11.9.

(ray) [jlquinn@cccxl005 madlad400-multigpu-translate]$ ray start --head --dashboard-host 0.0.0.0 --temp-dir /tmp/ray/jlquinn
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 9.47.193.65

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='9.47.193.65:6379'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://9.47.193.65:8265' ray job submit --working-dir . -- python my_script.py
  
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
  
  To monitor and debug Ray, view the dashboard at 
    9.47.193.65:8265
  
  If connection to the dashboard fails, check your firewall settings and network configuration.
(ray) [jlquinn@cccxl005 madlad400-multigpu-translate]$ ray status
Traceback (most recent call last):
  File "/dccstor/jlquinn01/miniforge3/envs/ray/bin/ray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2615, in main
    return cli()
           ^^^^^
  File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/ray/scripts/scripts.py", line 1995, in status
    address = services.canonicalize_bootstrap_address_or_die(address)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dccstor/jlquinn01/miniforge3/envs/ray/lib/python3.11/site-packages/ray/_private/services.py", line 584, in canonicalize_bootstrap_address_or_die
    raise ConnectionError(
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting the `--address` flag or `RAY_ADDRESS` environment variable.
(ray) [jlquinn@cccxl005 madlad400-multigpu-translate]$ ray status --address='9.47.193.65:6379'
No cluster status. It may take a few seconds for the Ray internal services to start up.

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I haven’t yet gotten to adding the real worker node because I can’t get past this point.

As far as I can tell, there is no error when you execute ray status with address, it just says the cluster isn’t reporting any status yet. You should try connecting to the cluster dashboard with a browser, or testing if the ports are reachable with nmap (there’s an example here).

It does seem very strange that your local node IP is reporting a public IP address. It is recommended that Ray cluster runs in a controlled network environment, otherwise anyone can execute arbitrary code on it. However, it could be that due to network misconfiguration what the node thinks is the address is incorrect, so you could try overriding head node IP address with --node-ip-address 127.0.0.1 and see if this makes a difference.

1 Like