How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I am setting up a Ray cluster on two servers within the same internal network. The IP addresses of the two servers are IP1 and IP2. When I start the head node with the command ray start --head --port=6379 and then start the worker node with ray start --address='IP1:6379', initially, the ray status command shows that there are two active nodes. However, after a minute, one of the nodes disappears and only the head node remains. What could be the cause of this issue? Additionally, I noticed that there is no cluster node information on the dashboard. How should I troubleshoot this problem?
Do you have an autoscaler set up? In those yamls normally there is a timeout for inactive nodes set using the keyword “idle_timeout_minutes”. It looks like an “intended” behavior b/c it doesn’t show up on “recent failures”
To configure idle_timeout_minutes in the autoscaler on ray start command, you can add the --autoscaling-config flag to the ray start command in the head_start_ray_commands or worker_start_ray_commands sections of your configuration file. The --autoscaling-config flag should point to a YAML file that contains the autoscaler configuration, including the idle_timeout_minutes setting. For example:
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=/path/to/autoscaler_config.yaml --dashboard-host=0.0.0.0
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --autoscaling-config=/path/to/autoscaler_config.yaml
The autoscaler_config.yaml file might look like this:
idle_timeout_minutes: 10
This will configure the autoscaler to remove idle worker nodes after they have been idle for 10 minutes.
You can also set the autoscale_idle_timeout_minutes setting in the ray_bootstrap_config.yaml file, which is used to configure the Ray cluster launcher. This setting will be used if the --autoscaling-config flag is not specified.