GPU memory is not freed on cluster after ctrl+c - can I respond to specific errors from within a client node?

Hi,
I am running a cluster with a head node and continuous autoscaled side nodes (on which I connect to the head node via CLI ray start params… on the side nodes the jobs are running from a tune run, which I start on the head node. When I stop the tune run with ctrl+c, and restart it with “resume=True”, basically on all the worker nodes the GPU is still being used by the previous worker process from ray start … so after resuming the run, on all the worker nodes I immediately get a CUDA error. I have to either spin up new worker nodes, or kill the process on the worker nodes with ray stop, and start ray again…

I tried “reuse_actor=False”, but that did not seem to impact anything… probably it is not connected.

  1. Any idea why the GPU is not cleared on the worker nodes when ctrl+c?

  2. A solution might be that I watch for “CUDA errors” on the client node, and if I see one I can restart ray on the client with ray stop and ray start etc. Is this possible? Can I watch for specific error message on the client … parse it, and respond in the above way?

Best,
Thorsten

Hey @thoglu, thanks for bringing this to our attention.

I’d expect GPU memory to get cleared after a keyboard interrupt. Could you share a reproduction script with us so we can investigate? I’m not sure why this is happening.

Hey @bveeramani,
I am using a simple tune script. Something like

if(args.resume):
  tuner = Tuner.restore(
      path=os.path.join(results_dir, "test")
  )
  
  tuner.fit()
else:
  trainable=tune.with_resources(trainable, resources={"cpu": 4, "gpu": 1})
  
  failure_config=FailureConfig(max_failures=-1)
  
  run_config=RunConfig(name="test",
                       local_dir=results_dir,
                       failure_config=failure_config,
                       log_to_file=True
                       )
  
  tune_config=TuneConfig(num_samples=1,
                         reuse_actors=False
                         )
  
  tuner=Tuner(new_trainable, run_config=run_config, tune_config=tune_config, param_space=cfg)
  
  tuner.fit()

In the trainable there is a simple pytorch training loop. This should be rather easy to reproduce I guess, if there is a default trainable with a loop, and then ctrl+x on the head node , restart with resume. It has to use the GPU though, so the GPU can get filled up.

Ok this spawned an issue and the solution seems to be updating the NVIDIA driver to something new (515.xxx) … see e.g. [core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster · Issue #31451 · ray-project/ray · GitHub … or wait for a bugfix