Clear DEAD nodes?

patflick · February 6, 2024, 2:35am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I’m iterating on various different worker-cluster sizes for scaling experiments.

I keep the same head node, but turn down worker nodes in bulk, and then add more worker nodes again.

ray.nodes() shows all nodes I’ve ever used during my experiments (> 1k), almost all are marked DEAD.

Also, the python-core-worker logs are super spammy, example (this is repeated ~1k times on each node):

[2024-02-06 02:23:04,390 I 6199 6229] core_worker.cc:318: Node failure from 0bfd7f59b1f71a69ec61149377f62a8580ae61d9a277447f552ec0a0. All objects pinned on that node will be lost if object reconstruction is not enabled.
[2024-02-06 02:23:04,390 W 6199 6229] core_worker.cc:4401: Node change state to DEAD but num_alive_node is 0.
[2024-02-06 02:23:04,390 I 6199 6229] accessor.cc:627: Received notification for node id = 027a19da31942f5ecdc958bcd5422e3530bfdafa2ac3bfd0433e1471, IsAlive = 0
[2024-02-06 02:23:04,390 I 6199 6229] core_worker.cc:318: Node failure from 027a19da31942f5ecdc958bcd5422e3530bfdafa2ac3bfd0433e1471. All objects pinned on that node will be lost if object reconstruction is not enabled.
[2024-02-06 02:23:04,390 W 6199 6229] core_worker.cc:4401: Node change state to DEAD but num_alive_node is 0.
[2024-02-06 02:23:04,390 I 6199 6229] accessor.cc:627: Received notification for node id = b410d01d3613e4e71228d74ae83ccf03d10e4357de21e37684bd234f, IsAlive = 0

I wonder if this many dead nodes adds extra overhead to Ray.

My main question: Is there a way to clear/prune DEAD nodes from Ray?

yic · February 6, 2024, 10:44am

Hi @patflick it shouldn’t add overhead except some metadata stored in GCS which will be GCed periodically.

These are just logs, and when bad thing happens, we need to notify the workers/raylets.

patflick · February 7, 2024, 11:09pm

Ray seems to broadcast periodically the node status of all past nodes to all live nodes. This seems like unnecessary overhead without pruning of known dead nodes. Is there anything I can do to trigger the GC you mentioned for dead nodes? Thanks.

yic · February 9, 2024, 3:07am

When node status changed, it’ll broadcast this information to all other nodes which is necessary and important.

So when node becomes dead, it’ll be broadcasted to the other nodes. We don’t periodically broadcast information if there is no change in the node status.

dead node won’t impact cluster perfs. we have release test killing and adding nodes for 24 hours and see no issue. And we have have ray serve tests, killing and adding nodes for several days and see no issues.

Topic		Replies	Views
Worker node workers/cores aren't not working	1	596	May 2, 2022
Ray job is stuck when node worker runs on is killed Ray Core	3	1685	July 1, 2022
How to debug when node dies? Ray Clusters	3	1183	July 27, 2023
Worker node unable to retrieve object Ray Core	2	536	November 30, 2022
Confused with coreworker and worker Ray Core	3	685	August 7, 2022

Clear DEAD nodes?

Related topics