I have a Ray Task performing a minute-long computation and then returning a small ~8MB result. In my program I batch my long input into chunks for input into this Ray Task:
for input_chunk in chunker(all_inputs, chunk_size=100): tasks = fancy_ray_function.remote(input_chunk) results = ray.get(tasks)
A few chunks in, my script hangs/freezes waiting for
ray.get() to return a value. Looking at the logs, I’ve found the Task is completing successfully, but then a
Exit Signal removes the worker, which I assume is the cause of the freezing/hanging:
20[2021-09-24 13:52:16,354 I 460 460] core_worker.cc:2332: Finished executing task 25de4c7fa0336f7effffffffffffffffffffffff13000000, status=OK 21[2021-09-24 13:52:31,084 I 460 476] core_worker.cc:769: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=IDLE_EXIT 22[2021-09-24 13:52:31,085 I 460 460] core_worker.cc:325: Removed worker 8d36544720137d36b4c848a4ef61ec06e4b155f538bf468019523488 23[2021-09-24 13:52:31,088 I 460 460] core_worker.cc:197: Destructing CoreWorkerProcess. pid: 460 24[2021-09-24 13:52:31,088 I 460 460] io_service_pool.cc:47: IOServicePool is stopped.
This feels like a race condition, but I can’t tell if it’s an actual bug or something wrong with my use of Ray. How can I determine where this
Exit signal is coming from?
To workaround this intermittent failure, I could set a timeout in
ray.get() and resubmit the failing input chunks. However, I’d still like to get to the core of the problem.