Hi,
I am running a cluster with a head node and continuous autoscaled side nodes (on which I connect to the head node via CLI ray start params… on the side nodes the jobs are running from a tune run, which I start on the head node. When I stop the tune run with ctrl+c, and restart it with “resume=True”, basically on all the worker nodes the GPU is still being used by the previous worker process from ray start … so after resuming the run, on all the worker nodes I immediately get a CUDA error. I have to either spin up new worker nodes, or kill the process on the worker nodes with ray stop, and start ray again…
I tried “reuse_actor=False”, but that did not seem to impact anything… probably it is not connected.
Any idea why the GPU is not cleared on the worker nodes when ctrl+c?
A solution might be that I watch for “CUDA errors” on the client node, and if I see one I can restart ray on the client with ray stop and ray start etc. Is this possible? Can I watch for specific error message on the client … parse it, and respond in the above way?
Hey @thoglu, thanks for bringing this to our attention.
I’d expect GPU memory to get cleared after a keyboard interrupt. Could you share a reproduction script with us so we can investigate? I’m not sure why this is happening.
In the trainable there is a simple pytorch training loop. This should be rather easy to reproduce I guess, if there is a default trainable with a loop, and then ctrl+x on the head node , restart with resume. It has to use the GPU though, so the GPU can get filled up.