Simulating preemption while developing?

Yoav · February 21, 2021, 1:18am

The current ActorPool does not support actor preemption well: when a machine is preempted, the actor is killed and the entire job dies.

I would like to develop an ActorPool that is more resilient to preemption (identify actors dying, returning their failed jobs back to the queue, and instantiating an alternative actor). I think I have the general idea of how to approach that, but what I don’t have is a good way to simulate preemption short of the very time-consuming and cumbersome process of stopping the gcp machine. What would be a realistic way to simulate preemption for development time?

sangcho · February 21, 2021, 2:05am

When we are simulating multi node clusters, we use ray/cluster_utils.py at master · ray-project/ray · GitHub. (Note this is not a public API). You can basically simulate preempted nodes in this way;

c = Cluster()
# The first node is always a head node
c.add_node(num_cpus=4)
# 4 worker nodes
worker_nodes = []
for _ in range(4):
   worker_nodes.append(c.add_node(num_cpus=4))
# Wait until all nodes are ready.
c.wait_for_nodes()

while True:
    time.sleep(1)
    # 
    preempted_node = worker_nodes.pop(0)
    c.remove_node(preempted_node)
    worker_nodes.append(c.add_node(num_cpus=4))

And you can create another script (driver) to connect this fake cluster using ray.init(address=‘auto’)

Topic		Replies	Views
Backpressure with ActorPool (or alternatives?) Ray Core	1	316	August 3, 2021
Creating actors when their amount is more than `num_cpus` Ray Core	8	3817	April 29, 2021
Weird Interaction between Actor Pool and node-specific actors handles Ray Core	1	234	August 19, 2023
Actor placement and execution resources Ray Core	8	180	December 12, 2023
Actors pool - process stuck / tasks lost on a long run Ray Core	4	459	February 24, 2022

Simulating preemption while developing?

Related Topics