Sys.exit from inside actor function gives unexpected results

  • None: Just asking a question out of curiosity

I am trying to understand how ray signal hooks work

I have the following code as an example
#################################################
import os
import sys
import ray
import time

ray.init()

@ray.remote(max_restarts=-1, max_task_retries=1)
class Actor:
def init(self):
print(“constructor”, flush=True)
self.counter = 0

def increment_and_possibly_fail(self, err_type):
    print("start task counter" + str(self.counter) + "_type" + str(err_type), flush=True)
    time.sleep(3)
    self.counter += 1
    if err_type == 1:
        print("err_type 1", flush=True)
        if not os.path.exists('tempTaskDir'):
            print("create dir and crash", flush=True)
            os.mkdir('tempTaskDir')
            sys.exit()
            #os._exit(0)
        else:
            print("skip crash", flush=True)

    print("End of task counter" + str(self.counter) + "_type" + str(err_type), flush=True)
    return self.counter

actor = Actor.options(max_concurrency=1).remote()

if os.path.exists(‘tempTaskDir’):
os.rmdir(‘tempTaskDir’)

task_ref1 = actor.increment_and_possibly_fail.remote(1)
task_ref2 = actor.increment_and_possibly_fail.remote(2)

time.sleep(60)
##############################################
what I expect to happen is the following output:

(Actor pid=25607) constructor
(Actor pid=25607) start task counter0_type1
(Actor pid=25607) err_type 1
(Actor pid=25607) create dir and crash
(Actor pid=25842) constructor
(Actor pid=25842) start task counter0_type1
(Actor pid=25842) err_type 1
(Actor pid=25842) skip crash
(Actor pid=25842) End of task counter1_type1
(Actor pid=25842) start task counter1_type2
(Actor pid=25842) End of task counter2_type2

as in I run my remote function on the actor, the actor crashes, it tries the remote function again, passes and then the second remote function runs

What I get is the following:

(Actor pid=3973) constructor
(Actor pid=3973) start task counter0_type1
(Actor pid=3973) err_type 1
(Actor pid=3973) create dir and crash
(Actor pid=3973) start task counter1_type2
(Actor pid=3973) End of task counter2_type2
(Actor pid=3973)
(Actor pid=4409) constructor

which is essentially : run remote function on actor that is supposed to crash the actor (but doesn’t yet), run second remote function, then actor crashes. My second run of the remote function that crashed never ran

When I remove the sleep(3) I get: run remote function on actor that is supposed to crash the actor (but doesn’t yet), run second remote function, then actor crashes. Then the second run of the remote function that causes the crash happens

when I change sys.exit to os._exit I get what I expected.

So I am wondering what happens inside the worker to give me the given results.

Thanks

sys.exit is a “clean exit” and will try to run any exit handlers before exiting the process. If the second task is already queued on the actor by the time sys.exit is called, then I believe the actor will try to execute the task before exiting.

Also, just note that Ray does not automatically persist the actor’s state for you. So when the actor restarts, you should expect that self.counter gets reset to 0.

thanks, that explains why the second task would run before the exit
it doesn’t explain why when i put in the sleep the retry never happens though.

It seems to be a side effect of the above behavior, where from Ray’s perspective, the two tasks did “finish” successfully before the exit happened. In practice, it’s a bit hard to know exactly which tasks will get re-executed since it depends on whether the client received a reply before the crash, and there is no way to guarantee the ordering of these events during actual failures.