@ray.remote(max_calls=1) executing too early


I have run into a subtle issue with @ray.remote. I am running a keras model on a server with a single GPU as part of a grid search. I am using max_calls=1 with @ray.remote to make sure that all resources are cleaned up between runs. Otherwise, I get OOM errors. Now, what I am seeing is that after setting up the nvidia persistence daemon, the 2nd tasks starts too quickly after the first and this leads to an OOM error. I can fix this problem by simply sleeping for a few seconds between runs. So it seems that the next task is executing too quickly after the first and starts while the cleanup of the previous task is not yet finished.

This is what I am doing now:

@ray.remote(num_gpus = 1, max_calls = 1)
def ray_train_model(*, model_name, loss_name):
    import time
    return train_model(model_name, loss_name)

Note that I am doing a sleep at the beginning to wait for cleanups from the previous task. Doing a sleep at the end would not work of course.


In ray the effect of max_calls is actually manifested as the worker process exit. There might be some cleanup that happens before the worker process properly exits. I think what happened was:

first worker finishes
first worker start exit seqeuence
second worker starts
second worker starts executing task
[but first worker still haven't finish cleanup up (e.g. freeing CUDA memory)]

I don’t think there’s a direct workaround other than the sleep. Thanks for bringing it up though!