RuntimeError: No CUDA GPUs are available

Hi, I want to run a benchmark task with ray.tune. I implement a very simple logic to run an algorithm with different hyper-parameters and random seeds. The idea of the script is shown below.

SEEDS = [...]

def training_function(config):
    setup_seed(config['seed'])
    return training(config)

if __name__ == '__main__':
    ray.init('auto')

    config = {}
    grid_tune = ...
    for k, v in grid_tune.items():
        config[k] = tune.grid_search(v)

    config['seed'] = tune.grid_search(SEEDS)
    
    analysis = tune.run(
        training_function,
        name='benchmark',
        config=config,
        queue_trials=True,
        metric='reward',
        mode='max',
        resources_per_trial={
            "cpu": 1,
            "gpu": 0.5,
        }
    )
       
    upload_result()

In one of my experiments, I am running the algorithms with 16 configurations of hyper-parameters and 3 random seeds. Thus, there are totally 48 trials to run. I have 3 nodes to run the experiment with 4 cards on each node. So the task needs 2 round to complete, with 24 trials per round.

In the first round, everything works smoothly. However, in the second round, the trial raises RuntimeError: No CUDA GPUs are available. I have tried to sleep a little time (30s) to give ray more time to clean the resources, but the error still shows up.

Is there anyone knows what causes the problem and how to fix it? Thanks in advance.

Can you post the error message that you’re getting?

The error message is

  File "/home/ubuntu/anaconda3/envs/benchmark/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

Also, in ray_results/trail/error.txt, there is

Failure # 1 (occurred at 2021-04-19_23-27-45)
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/benchmark/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 519, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/ubuntu/anaconda3/envs/benchmark/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 497, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ubuntu/anaconda3/envs/benchmark/lib/python3.7/site-packages/ray/worker.py", line 1381, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

What version of Ray are you running? I’ve never seen this issue before. Can you also try "gpu": 1?

I have tried ray 1.1, 1.2 and 2.0dev, the error keeps showing up.

Also, I have tried "gpu": 1 before, doesn’t solve the issue.

Hmm do you have more info about why this error is showing up? Is it due to CUDA OOM? Or is it that there is no cuda visible devices that are properly set?

Is it possible that the solution is this ^ as shared here:

Changing the way the device was specified from device = torch.device(0) to device = "cuda:0" as in How to use Tune with PyTorch — Ray v1.2.0 fixed it.

  1. It is not due to CUDA OOM, the trial only requires 2G memory while the GPU has 16G memory.
  2. I have printed os.environ['CUDA_VISIBLE_DEVICES'], and it is correctly set.
  3. I am using device = 'cuda', which should mean the same thing in PyTorch as cuda:0 when there is only 1 GPU available. And I remember I have tried to use device = 'cuda:0' but doesn’t work. I am not 100% percent sure about this, and I will try it again.

The information I can provide is limited since the error doesn’t show me more. I think there is some problems in the resource recycling and initialization of workers, but I am not clear how these works in ray.

P.S. Don’t know if it is related, but I have tried another way to launch the experiment. In the experiment, I discard tune and implement the grid search myself which results in launching 48 workers at the same time. I also use max_calls=1 to let ray release the resource. The worker on normal behave correctly with 2 trials per GPU. However, on the head node, although the os.environ['CUDA_VISIBLE_DEVICES'] shows a different value, all 8 workers are run on GPU 0. When the old trails finished, new trails also raise RuntimeError: No CUDA GPUs are available.

I have tried device = 'cuda:0', doesn’t work either. :cry:

I use a smaller cluster which have only 4 cards to run the experiment. I notice that not all the late round workers raise this error. Workers in later rounds have more chance to work properly.

Hi there, I have met the totally same error with you. If I have 48 trials to run, but my resources only support me for 12 trials one time, 36 trials to pend. When the cluster finished the first 12 trials, it raise “no CUDA GPUS are available” error when the experiment goes the secend round. Did you find the solution?

Yes, I do have found some hacks to work around this issue. Here is an example of the hack:

def training_function(config):
    assert torch.cuda.is_available()
    # do your training here

tune.run(
    training_function,
    max_failures=100, # set this to a large value, 100 works in my case
    # more parameters for your problem
)

For trails that do not initialize GPU correctly, it will fail by the assertion. By setting max_failures to a very large value, ray will keep relaunch the trail until it is running correctly.