Failed to set up Ray cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am setting up a Ray cluster on two servers within the same internal network. The IP addresses of the two servers are IP1 and IP2. When I start the head node with the command ray start --head --port=6379 and then start the worker node with ray start --address='IP1:6379', initially, the ray status command shows that there are two active nodes. However, after a minute, one of the nodes disappears and only the head node remains. What could be the cause of this issue? Additionally, I noticed that there is no cluster node information on the dashboard. How should I troubleshoot this problem?

Do you have an autoscaler set up? In those yamls normally there is a timeout for inactive nodes set using the keyword “idle_timeout_minutes”. It looks like an “intended” behavior b/c it doesn’t show up on “recent failures”

However, I did not use a YAML file, but simply ran the ray start command. How do I set idle_timeout_minutes in this case?

To configure idle_timeout_minutes in the autoscaler on ray start command, you can add the --autoscaling-config flag to the ray start command in the head_start_ray_commands or worker_start_ray_commands sections of your configuration file. The --autoscaling-config flag should point to a YAML file that contains the autoscaler configuration, including the idle_timeout_minutes setting. For example:

head_start_ray_commands:
  - ray stop
  - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=/path/to/autoscaler_config.yaml --dashboard-host=0.0.0.0

worker_start_ray_commands:
  - ray stop
  - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --autoscaling-config=/path/to/autoscaler_config.yaml

Copy to clipboard

The autoscaler_config.yaml file might look like this:

idle_timeout_minutes: 10

Copy to clipboard

This will configure the autoscaler to remove idle worker nodes after they have been idle for 10 minutes.

You can also set the autoscale_idle_timeout_minutes setting in the ray_bootstrap_config.yaml file, which is used to configure the Ray cluster launcher. This setting will be used if the --autoscaling-config flag is not specified.

For more information, you can refer to the Ray documentation on configuring autoscaling here: Cluster YAML Configuration Options — Ray 3.0.0.dev0

1 Like