Sys.exit from inside actor function gives unexpected results

shiranbi · November 27, 2022, 3:16pm

None: Just asking a question out of curiosity

I am trying to understand how ray signal hooks work

I have the following code as an example
#################################################
import os
import sys
import ray
import time

ray.init()

@ray.remote(max_restarts=-1, max_task_retries=1)
class Actor:
def init(self):
print(“constructor”, flush=True)
self.counter = 0

def increment_and_possibly_fail(self, err_type):
    print("start task counter" + str(self.counter) + "_type" + str(err_type), flush=True)
    time.sleep(3)
    self.counter += 1
    if err_type == 1:
        print("err_type 1", flush=True)
        if not os.path.exists('tempTaskDir'):
            print("create dir and crash", flush=True)
            os.mkdir('tempTaskDir')
            sys.exit()
            #os._exit(0)
        else:
            print("skip crash", flush=True)

    print("End of task counter" + str(self.counter) + "_type" + str(err_type), flush=True)
    return self.counter

actor = Actor.options(max_concurrency=1).remote()

if os.path.exists(‘tempTaskDir’):
os.rmdir(‘tempTaskDir’)

task_ref1 = actor.increment_and_possibly_fail.remote(1)
task_ref2 = actor.increment_and_possibly_fail.remote(2)

time.sleep(60)
##############################################
what I expect to happen is the following output:

(Actor pid=25607) constructor
(Actor pid=25607) start task counter0_type1
(Actor pid=25607) err_type 1
(Actor pid=25607) create dir and crash
(Actor pid=25842) constructor
(Actor pid=25842) start task counter0_type1
(Actor pid=25842) err_type 1
(Actor pid=25842) skip crash
(Actor pid=25842) End of task counter1_type1
(Actor pid=25842) start task counter1_type2
(Actor pid=25842) End of task counter2_type2

as in I run my remote function on the actor, the actor crashes, it tries the remote function again, passes and then the second remote function runs

What I get is the following:

(Actor pid=3973) constructor
(Actor pid=3973) start task counter0_type1
(Actor pid=3973) err_type 1
(Actor pid=3973) create dir and crash
(Actor pid=3973) start task counter1_type2
(Actor pid=3973) End of task counter2_type2
(Actor pid=3973)
(Actor pid=4409) constructor

which is essentially : run remote function on actor that is supposed to crash the actor (but doesn’t yet), run second remote function, then actor crashes. My second run of the remote function that crashed never ran

When I remove the sleep(3) I get: run remote function on actor that is supposed to crash the actor (but doesn’t yet), run second remote function, then actor crashes. Then the second run of the remote function that causes the crash happens

when I change sys.exit to os._exit I get what I expected.

So I am wondering what happens inside the worker to give me the given results.

Thanks

Stephanie_Wang · November 28, 2022, 11:17pm

sys.exit is a “clean exit” and will try to run any exit handlers before exiting the process. If the second task is already queued on the actor by the time sys.exit is called, then I believe the actor will try to execute the task before exiting.

Also, just note that Ray does not automatically persist the actor’s state for you. So when the actor restarts, you should expect that self.counter gets reset to 0.

shiranbi · November 29, 2022, 5:19am

thanks, that explains why the second task would run before the exit
it doesn’t explain why when i put in the sleep the retry never happens though.

Stephanie_Wang · November 29, 2022, 9:50pm

It seems to be a side effect of the above behavior, where from Ray’s perspective, the two tasks did “finish” successfully before the exit happened. In practice, it’s a bit hard to know exactly which tasks will get re-executed since it depends on whether the client received a reply before the crash, and there is no way to guarantee the ordering of these events during actual failures.

Topic		Replies	Views
Strange behavior when exiting Actor within asyncio Task Ray Core	1	536	February 24, 2023
Ray actor with detached lifetime error, Job finishes as driver exits. Marking all non-terminal tasks as failed Ray Client	3	168	June 9, 2025
[ray] Inconsistent documentation: atexit behavior with ray.kill(actor) Ray Core	2	451	December 11, 2020
Driver on exit fails detached Actor Method	5	123	December 28, 2024
Cross language example hanging Ray Core	1	299	June 7, 2021

Sys.exit from inside actor function gives unexpected results

Related topics