RuntimeError: No CUDA GPUs are available

Icarus · April 19, 2021, 4:21pm

Hi, I want to run a benchmark task with ray.tune. I implement a very simple logic to run an algorithm with different hyper-parameters and random seeds. The idea of the script is shown below.

SEEDS = [...]

def training_function(config):
    setup_seed(config['seed'])
    return training(config)

if __name__ == '__main__':
    ray.init('auto')

    config = {}
    grid_tune = ...
    for k, v in grid_tune.items():
        config[k] = tune.grid_search(v)

    config['seed'] = tune.grid_search(SEEDS)
    
    analysis = tune.run(
        training_function,
        name='benchmark',
        config=config,
        queue_trials=True,
        metric='reward',
        mode='max',
        resources_per_trial={
            "cpu": 1,
            "gpu": 0.5,
        }
    )
       
    upload_result()

In one of my experiments, I am running the algorithms with 16 configurations of hyper-parameters and 3 random seeds. Thus, there are totally 48 trials to run. I have 3 nodes to run the experiment with 4 cards on each node. So the task needs 2 round to complete, with 24 trials per round.

In the first round, everything works smoothly. However, in the second round, the trial raises RuntimeError: No CUDA GPUs are available. I have tried to sleep a little time (30s) to give ray more time to clean the resources, but the error still shows up.

Is there anyone knows what causes the problem and how to fix it? Thanks in advance.

rliaw · April 20, 2021, 5:15pm

Can you post the error message that you’re getting?

Icarus · April 21, 2021, 2:08am

The error message is

  File "/home/ubuntu/anaconda3/envs/benchmark/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

Also, in ray_results/trail/error.txt, there is

Failure # 1 (occurred at 2021-04-19_23-27-45)
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/benchmark/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 519, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/ubuntu/anaconda3/envs/benchmark/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 497, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ubuntu/anaconda3/envs/benchmark/lib/python3.7/site-packages/ray/worker.py", line 1381, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

rliaw · April 21, 2021, 7:47am

What version of Ray are you running? I’ve never seen this issue before. Can you also try "gpu": 1?

Icarus · April 21, 2021, 7:52am

I have tried ray 1.1, 1.2 and 2.0dev, the error keeps showing up.

Also, I have tried "gpu": 1 before, doesn’t solve the issue.

rliaw · April 21, 2021, 11:23pm

Hmm do you have more info about why this error is showing up? Is it due to CUDA OOM? Or is it that there is no cuda visible devices that are properly set?

rliaw · April 22, 2021, 2:05am

github.com/ray-project/ray

How to use ray.tune on cluster node with multiple GPUs

opened 04:14PM - 06 Jan 21 UTC

closed 04:40AM - 02 Jul 21 UTC

Shentao-YANG

Hi, I would like to use `ray.tune.scheduler` on hyperparameter tuning of Pyto…rch neural network on one node of the slurm cluster provided by my institution. The node has in total 4 GPUs. Ideally I would like `ray.tune.scheduler` to run and select models parallelly on all 4 GPUs. I tried to adapt the instruction on page **/using-ray-with-pytorch.html** but was not successful --- the `tune.run` function seems only use one GPU (`cuda:0`) . Could you please provide an example script of using `ray.tune.scheduler` (say, `PBT`, `BOHB`) on hyperparameter tuning of Pytorch model utilizing multiple GPUs?

Is it possible that the solution is this ^ as shared here:

Changing the way the device was specified from device = torch.device(0) to device = "cuda:0" as in https://docs.ray.io/en/latest/tune/tutorials/tune-pytorch-cifar.html#adding-multi-gpu-support-with-dataparallel fixed it.

Icarus · April 22, 2021, 2:56am

It is not due to CUDA OOM, the trial only requires 2G memory while the GPU has 16G memory.
I have printed os.environ['CUDA_VISIBLE_DEVICES'], and it is correctly set.
I am using device = 'cuda', which should mean the same thing in PyTorch as cuda:0 when there is only 1 GPU available. And I remember I have tried to use device = 'cuda:0' but doesn’t work. I am not 100% percent sure about this, and I will try it again.

The information I can provide is limited since the error doesn’t show me more. I think there is some problems in the resource recycling and initialization of workers, but I am not clear how these works in ray.

P.S. Don’t know if it is related, but I have tried another way to launch the experiment. In the experiment, I discard tune and implement the grid search myself which results in launching 48 workers at the same time. I also use max_calls=1 to let ray release the resource. The worker on normal behave correctly with 2 trials per GPU. However, on the head node, although the os.environ['CUDA_VISIBLE_DEVICES'] shows a different value, all 8 workers are run on GPU 0. When the old trails finished, new trails also raise RuntimeError: No CUDA GPUs are available.

Icarus · April 23, 2021, 3:31am

I have tried device = 'cuda:0', doesn’t work either.

I use a smaller cluster which have only 4 cards to run the experiment. I notice that not all the late round workers raise this error. Workers in later rounds have more chance to work properly.

haoran_zhao · October 22, 2021, 6:51am

Hi there, I have met the totally same error with you. If I have 48 trials to run, but my resources only support me for 12 trials one time, 36 trials to pend. When the cluster finished the first 12 trials, it raise “no CUDA GPUS are available” error when the experiment goes the secend round. Did you find the solution?

Icarus · October 22, 2021, 12:37pm

Yes, I do have found some hacks to work around this issue. Here is an example of the hack:

def training_function(config):
    assert torch.cuda.is_available()
    # do your training here

tune.run(
    training_function,
    max_failures=100, # set this to a large value, 100 works in my case
    # more parameters for your problem
)

For trails that do not initialize GPU correctly, it will fail by the assertion. By setting max_failures to a very large value, ray will keep relaunch the trail until it is running correctly.

Melih_Dal · October 8, 2022, 1:11pm

Hi, I realize this quiestion is old but still I found a solution and want to share it.

def test_pytorch(config):
    print('is cuda avaiable ',torch.cuda.is_available())
    tune.report({'metric':0})

tuner = tune.Tuner(tune.with_resources(test_pytorch, {"gpu": 1}), 
   tune_config=tune.TuneConfig(num_samples=10))
results = tuner.fit()

the documentation is here A Guide To Parallelism and Resources — Ray 2.0.0

zhanwenchen · February 3, 2023, 6:27am

After much trial and error, I was able to resolve this issue by providing an explicit GPU count and the trial name. Here’s my working code:

import ray
from ray import tune, air
from ray.air import session

train_loop = train_loop_original
test_loop = test_loop_original

# accuracies_train_all_epochs, accuracies_val_all_epochs, losses_train_all_epochs, losses_val_all_epochs = train(model, NUM_EPOCHS, loss_function, optimizer, dataloader_train, dataloader_valid, calculating_accuracy=False)
# 1. Wrap a PyTorch model in an objective function.
# NOTE: the objective function only takes a config dictionary.
def objective(config):
    width = config["width"]
    # NOTE: model cannot be on CUDA. It needs to be on the CPU otherwise
    #       ray will throw a "No CUDA GPUS available" error
    model = Sequential(
        Linear(8, width),
        ReLU(),
        Linear(width, width),
        ReLU(),
        Linear(width, 1),
    ).cuda()
    optimizer = SGD(  # Tune the optimizer
        model.parameters(), lr=config["lr"], momentum=config["momentum"]
    )

    loss_fn = MSELoss()

    while True:
        _, _, _ = train_loop(dataloader_train, model, loss_fn, optimizer, calculating_accuracy=False) # Train the model
        _, loss_valid = test_loop(dataloader_valid, model, loss_function, method="Valid", calculating_accuracy=False)
        session.report({"loss_valid": loss_valid})  # Report to Tune

# 2. Define a search space and initialize the search algorithm.
search_space = {
    "lr": tune.loguniform(1e-4, 1e-2),
    "momentum": tune.uniform(0.1, 0.9),
    "width": tune.choice([20, 30, 50, 100]),
}


# 3. Start a Tune run that maximizes mean accuracy and stops after 5 iterations.
tuner = tune.Tuner(
    tune.with_resources(
        tune.with_parameters(objective),
        resources={"cpu": 2, "gpu": 1}
    ),
    tune_config=tune.TuneConfig(
        metric="loss_valid",
        mode="min",
    ),
    run_config=air.RunConfig(
        name="test",
        stop={"training_iteration": 5},
        # failure_config=air.config.FailureConfig(max_failures=100),
    ),
    param_space=search_space,
)
results = tuner.fit()
print("Best config is:", results.get_best_result().config)

Topic		Replies	Views
getGPUs error when no GPU avaliable Ray Tune	0	308	July 7, 2023
NVIDIA GPU not deteted RLlib	3	475	October 3, 2021
Ray Train/Tune issue: concurrent trials conflict on GPU nodes Ray Tune	2	50	February 12, 2025
Getting Started with Ray does not work on any computer I try it Ray Tune	4	2411	September 13, 2023
ERROR: Check failed: resource_pair.second > 0 Ray Tune	2	379	October 18, 2021

RuntimeError: No CUDA GPUs are available

Related topics