- None: Just asking a question out of curiosity
I am trying to understand how ray signal hooks work
I have the following code as an example
#################################################
import os
import sys
import ray
import time
ray.init()
@ray.remote(max_restarts=-1, max_task_retries=1)
class Actor:
def init(self):
print(“constructor”, flush=True)
self.counter = 0
def increment_and_possibly_fail(self, err_type):
print("start task counter" + str(self.counter) + "_type" + str(err_type), flush=True)
time.sleep(3)
self.counter += 1
if err_type == 1:
print("err_type 1", flush=True)
if not os.path.exists('tempTaskDir'):
print("create dir and crash", flush=True)
os.mkdir('tempTaskDir')
sys.exit()
#os._exit(0)
else:
print("skip crash", flush=True)
print("End of task counter" + str(self.counter) + "_type" + str(err_type), flush=True)
return self.counter
actor = Actor.options(max_concurrency=1).remote()
if os.path.exists(‘tempTaskDir’):
os.rmdir(‘tempTaskDir’)
task_ref1 = actor.increment_and_possibly_fail.remote(1)
task_ref2 = actor.increment_and_possibly_fail.remote(2)
time.sleep(60)
##############################################
what I expect to happen is the following output:
(Actor pid=25607) constructor
(Actor pid=25607) start task counter0_type1
(Actor pid=25607) err_type 1
(Actor pid=25607) create dir and crash
(Actor pid=25842) constructor
(Actor pid=25842) start task counter0_type1
(Actor pid=25842) err_type 1
(Actor pid=25842) skip crash
(Actor pid=25842) End of task counter1_type1
(Actor pid=25842) start task counter1_type2
(Actor pid=25842) End of task counter2_type2
as in I run my remote function on the actor, the actor crashes, it tries the remote function again, passes and then the second remote function runs
What I get is the following:
(Actor pid=3973) constructor
(Actor pid=3973) start task counter0_type1
(Actor pid=3973) err_type 1
(Actor pid=3973) create dir and crash
(Actor pid=3973) start task counter1_type2
(Actor pid=3973) End of task counter2_type2
(Actor pid=3973)
(Actor pid=4409) constructor
which is essentially : run remote function on actor that is supposed to crash the actor (but doesn’t yet), run second remote function, then actor crashes. My second run of the remote function that crashed never ran
When I remove the sleep(3) I get: run remote function on actor that is supposed to crash the actor (but doesn’t yet), run second remote function, then actor crashes. Then the second run of the remote function that causes the crash happens
when I change sys.exit to os._exit I get what I expected.
So I am wondering what happens inside the worker to give me the given results.
Thanks