RuntimeError: No CUDA GPUs are available

Yes, I do have found some hacks to work around this issue. Here is an example of the hack:

def training_function(config):
    assert torch.cuda.is_available()
    # do your training here

tune.run(
    training_function,
    max_failures=100, # set this to a large value, 100 works in my case
    # more parameters for your problem
)

For trails that do not initialize GPU correctly, it will fail by the assertion. By setting max_failures to a very large value, ray will keep relaunch the trail until it is running correctly.

1 Like