Yes, I do have found some hacks to work around this issue. Here is an example of the hack:
def training_function(config):
assert torch.cuda.is_available()
# do your training here
tune.run(
training_function,
max_failures=100, # set this to a large value, 100 works in my case
# more parameters for your problem
)
For trails that do not initialize GPU correctly, it will fail by the assertion. By setting max_failures
to a very large value, ray will keep relaunch the trail until it is running correctly.