How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi,
I am manually setting up the cluster by running ray start with 4 nodes; and all 4 nodes would be removed after one node became idle. Here is the log:
2023-03-07 06:00:15,234 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: 3750ee9c62631c3efbb48b2a9c52471e7d1f3f36425e922eb2d22a20. 2023-03-07 06:00:15,234 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: 3063e53fed3ac900fbbdd2df89936d6df5eaaf6b1aa424a52c16d44e. 2023-03-07 06:00:15,234 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: 0bd1670240260db79954c8c8d8f15a7fba6b2825be123f9726a17c4c. 2023-03-07 06:00:15,234 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: bbd989be7dfe24ad482333fe669d94c30612629925b8eb174a354bea. 2023-03-07 06:00:15,234 INFO load_metrics.py:169 -- LoadMetrics: Removed 4 stale ip mappings: {'3750ee9c62631c3efbb48b2a9c52471e7d1f3f36425e922eb2d22a20', '3063e53fed3ac900fbbdd2df89936d6df5eaaf6b1aa424a52c16d44e', '0bd1670240260db79954c8c8d8f15a7fba6b2825be123f9726a17c4c', 'bbd989be7dfe24ad482333fe669d94c30612629925b8eb174a354bea'} not in set()
This leads to two questions:
how to prevent nodes becoming idle and removed from cluster; or
is there any configuration that I could set or any guidance to bulid a non-autoscaling clusters (mentioned in Docs Configuring Autoscaling — Ray 2.3.0 but no guidance)?
+1 to Jules’ question. In particular I’m also worried that the cause and effect may be flipped here. The logs you’ve shared reflect that the autoscaler has acknowledged the nodes have disappeared (but don’t mean the autoscaler removed them).
Those logs are admittedly potentially confusing and should be at a debug level instead.
I just tried set idle_timeout_minutes: 999999 in the YAML config and it did not work. All 4 nodes became idle after being inactivite for a few hours. I will try @Lars_Simon_Zehnder 's suggestion and let you know the result.
Also to confirm, above you mentioned using ray start to start your nodes, but with this type of config your top level command should be ray up config.yaml (which will call ray start under the hood)