@ray.remote(max_calls=1) executing too early

Erik_Brakkee · December 30, 2020, 4:46pm

Hi,

I have run into a subtle issue with @ray.remote. I am running a keras model on a server with a single GPU as part of a grid search. I am using max_calls=1 with @ray.remote to make sure that all resources are cleaned up between runs. Otherwise, I get OOM errors. Now, what I am seeing is that after setting up the nvidia persistence daemon, the 2nd tasks starts too quickly after the first and this leads to an OOM error. I can fix this problem by simply sleeping for a few seconds between runs. So it seems that the next task is executing too quickly after the first and starts while the cleanup of the previous task is not yet finished.

This is what I am doing now:

@ray.remote(num_gpus = 1, max_calls = 1)
def ray_train_model(*, model_name, loss_name):
    import time
    time.sleep(5)
    return train_model(model_name, loss_name)

Note that I am doing a sleep at the beginning to wait for cleanups from the previous task. Doing a sleep at the end would not work of course.

Cheers
Erik

simon-mo · December 30, 2020, 10:38pm

In ray the effect of max_calls is actually manifested as the worker process exit. There might be some cleanup that happens before the worker process properly exits. I think what happened was:

first worker finishes
first worker start exit seqeuence
second worker starts
second worker starts executing task
[but first worker still haven't finish cleanup up (e.g. freeing CUDA memory)]
OOM

I don’t think there’s a direct workaround other than the sleep. Thanks for bringing it up though!

Topic		Replies	Views
Behavior of max_calls of @ray.remote by default Ray Core	1	449	June 21, 2023
How to specify max_calls for functional API Ray Tune	3	793	March 26, 2021
How to wait for GPU memory to be released when using TensorFlow in a ray remote function Ray Core	1	193	January 25, 2024
Small bug in https://github.com/ray-project/ray/blob/master/python/ray/remote_function.py Ray Core	1	303	August 20, 2021
Memory Scheduled Tasks OOM Ray Clusters	1	38	August 9, 2024

@ray.remote(max_calls=1) executing too early

Related topics