Simulating preemption while developing?

The current ActorPool does not support actor preemption well: when a machine is preempted, the actor is killed and the entire job dies.

I would like to develop an ActorPool that is more resilient to preemption (identify actors dying, returning their failed jobs back to the queue, and instantiating an alternative actor). I think I have the general idea of how to approach that, but what I don’t have is a good way to simulate preemption short of the very time-consuming and cumbersome process of stopping the gcp machine. What would be a realistic way to simulate preemption for development time?

When we are simulating multi node clusters, we use ray/cluster_utils.py at master · ray-project/ray · GitHub. (Note this is not a public API). You can basically simulate preempted nodes in this way;

c = Cluster()
# The first node is always a head node
c.add_node(num_cpus=4)
# 4 worker nodes
worker_nodes = []
for _ in range(4):
   worker_nodes.append(c.add_node(num_cpus=4))
# Wait until all nodes are ready.
c.wait_for_nodes()

while True:
    time.sleep(1)
    # 
    preempted_node = worker_nodes.pop(0)
    c.remove_node(preempted_node)
    worker_nodes.append(c.add_node(num_cpus=4))

And you can create another script (driver) to connect this fake cluster using ray.init(address=‘auto’)

1 Like