Torch cuda not available within Tune Trainable

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi,

I am using Tune to train machine learning models through the Trainable API and having issues with Tune not making CUDA devices available within the Trainable. In the Trainable documentation

https://docs.ray.io/en/latest/tune/api/doc/ray.tune.Trainable.default_resource_request.html#ray.tune.Trainable.default_resource_request

it mentions we should use the default_resource_request method to provide a static resource requirement, but when specifying {‘gpu’:1}, CUDA remains unavailable within the Trainable.step method for training. I also found this short guide

https://docs.ray.io/en/latest/tune/tutorials/tune-resources.html

which suggests I wrap my Trainable callable in the tune.with_resources method:

trainable_with_gpu = tune.with_resources(trainable, {“gpu”: 1})
tuner = tune.Tuner(
trainable_with_gpu,
tune_config=tune.TuneConfig(num_samples=10)
)
results = tuner.fit()

Both these suggestions do not make CUDA devices available within the Trainable.step() method for me. I should mention that torch correctly shows my devices are available outside the tuner.fit() call and I’ve tried reinstalling torch and cuda to no avail. Am I doing something wrong?

I am publicly hosting my project on my Github: EricPfleiderer/timeseries-forecasting/

You can checkout commit : 99ac544 (CUDA issue, temp removing wandb) specifically.

Thanks,
Eric

github link: