Hi,
I have run into a subtle issue with @ray.remote. I am running a keras model on a server with a single GPU as part of a grid search. I am using max_calls=1 with @ray.remote to make sure that all resources are cleaned up between runs. Otherwise, I get OOM errors. Now, what I am seeing is that after setting up the nvidia persistence daemon, the 2nd tasks starts too quickly after the first and this leads to an OOM error. I can fix this problem by simply sleeping for a few seconds between runs. So it seems that the next task is executing too quickly after the first and starts while the cleanup of the previous task is not yet finished.
This is what I am doing now:
@ray.remote(num_gpus = 1, max_calls = 1)
def ray_train_model(*, model_name, loss_name):
import time
time.sleep(5)
return train_model(model_name, loss_name)
Note that I am doing a sleep at the beginning to wait for cleanups from the previous task. Doing a sleep at the end would not work of course.
Cheers
Erik