How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
I’m iterating on various different worker-cluster sizes for scaling experiments.
I keep the same head node, but turn down worker nodes in bulk, and then add more worker nodes again.
ray.nodes()
shows all nodes I’ve ever used during my experiments (> 1k), almost all are marked DEAD
.
Also, the python-core-worker logs are super spammy, example (this is repeated ~1k times on each node):
[2024-02-06 02:23:04,390 I 6199 6229] core_worker.cc:318: Node failure from 0bfd7f59b1f71a69ec61149377f62a8580ae61d9a277447f552ec0a0. All objects pinned on that node will be lost if object reconstruction is not enabled.
[2024-02-06 02:23:04,390 W 6199 6229] core_worker.cc:4401: Node change state to DEAD but num_alive_node is 0.
[2024-02-06 02:23:04,390 I 6199 6229] accessor.cc:627: Received notification for node id = 027a19da31942f5ecdc958bcd5422e3530bfdafa2ac3bfd0433e1471, IsAlive = 0
[2024-02-06 02:23:04,390 I 6199 6229] core_worker.cc:318: Node failure from 027a19da31942f5ecdc958bcd5422e3530bfdafa2ac3bfd0433e1471. All objects pinned on that node will be lost if object reconstruction is not enabled.
[2024-02-06 02:23:04,390 W 6199 6229] core_worker.cc:4401: Node change state to DEAD but num_alive_node is 0.
[2024-02-06 02:23:04,390 I 6199 6229] accessor.cc:627: Received notification for node id = b410d01d3613e4e71228d74ae83ccf03d10e4357de21e37684bd234f, IsAlive = 0
I wonder if this many dead nodes adds extra overhead to Ray.
My main question: Is there a way to clear/prune DEAD
nodes from Ray?