The current ActorPool does not support actor preemption well: when a machine is preempted, the actor is killed and the entire job dies.
I would like to develop an ActorPool that is more resilient to preemption (identify actors dying, returning their failed jobs back to the queue, and instantiating an alternative actor). I think I have the general idea of how to approach that, but what I don’t have is a good way to simulate preemption short of the very time-consuming and cumbersome process of stopping the gcp machine. What would be a realistic way to simulate preemption for development time?
c = Cluster()
# The first node is always a head node
c.add_node(num_cpus=4)
# 4 worker nodes
worker_nodes = []
for _ in range(4):
worker_nodes.append(c.add_node(num_cpus=4))
# Wait until all nodes are ready.
c.wait_for_nodes()
while True:
time.sleep(1)
#
preempted_node = worker_nodes.pop(0)
c.remove_node(preempted_node)
worker_nodes.append(c.add_node(num_cpus=4))
And you can create another script (driver) to connect this fake cluster using ray.init(address=‘auto’)