GPU memory is not freed on cluster after ctrl+c - can I respond to specific errors from within a client node?

thoglu · January 3, 2023, 5:20pm

Hi,
I am running a cluster with a head node and continuous autoscaled side nodes (on which I connect to the head node via CLI ray start params… on the side nodes the jobs are running from a tune run, which I start on the head node. When I stop the tune run with ctrl+c, and restart it with “resume=True”, basically on all the worker nodes the GPU is still being used by the previous worker process from ray start … so after resuming the run, on all the worker nodes I immediately get a CUDA error. I have to either spin up new worker nodes, or kill the process on the worker nodes with ray stop, and start ray again…

I tried “reuse_actor=False”, but that did not seem to impact anything… probably it is not connected.

Any idea why the GPU is not cleared on the worker nodes when ctrl+c?
A solution might be that I watch for “CUDA errors” on the client node, and if I see one I can restart ray on the client with ray stop and ray start etc. Is this possible? Can I watch for specific error message on the client … parse it, and respond in the above way?

Best,
Thorsten

bveeramani · January 3, 2023, 11:35pm

Hey @thoglu, thanks for bringing this to our attention.

I’d expect GPU memory to get cleared after a keyboard interrupt. Could you share a reproduction script with us so we can investigate? I’m not sure why this is happening.

thoglu · January 4, 2023, 12:57am

Hey @bveeramani,
I am using a simple tune script. Something like

if(args.resume):
  tuner = Tuner.restore(
      path=os.path.join(results_dir, "test")
  )
  
  tuner.fit()
else:
  trainable=tune.with_resources(trainable, resources={"cpu": 4, "gpu": 1})
  
  failure_config=FailureConfig(max_failures=-1)
  
  run_config=RunConfig(name="test",
                       local_dir=results_dir,
                       failure_config=failure_config,
                       log_to_file=True
                       )
  
  tune_config=TuneConfig(num_samples=1,
                         reuse_actors=False
                         )
  
  tuner=Tuner(new_trainable, run_config=run_config, tune_config=tune_config, param_space=cfg)
  
  tuner.fit()

In the trainable there is a simple pytorch training loop. This should be rather easy to reproduce I guess, if there is a default trainable with a loop, and then ctrl+x on the head node , restart with resume. It has to use the GPU though, so the GPU can get filled up.

thoglu · January 5, 2023, 10:09pm

Ok this spawned an issue and the solution seems to be updating the NVIDIA driver to something new (515.xxx) … see e.g. [core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster · Issue #31451 · ray-project/ray · GitHub … or wait for a bugfix

Topic		Replies	Views
GPU memory not cleared after trial Ray Tune	3	924	January 18, 2022
GPU Memory not clearing after one Ray tune task	2	294	September 14, 2023
Reporting progress from another process Ray Tune	10	542	July 26, 2022
Running trials on GPU and the Object Storage Ray Tune	2	289	July 28, 2021
[Tune] Ray tune for multi gpu and multi node runs Hangs Ray Libraries (Data, Train, Tune, Serve)	2	297	August 26, 2023

GPU memory is not freed on cluster after ctrl+c - can I respond to specific errors from within a client node?

Related Topics