How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi,
I am using Tune to train machine learning models through the Trainable API and having issues with Tune not making CUDA devices available within the Trainable. In the Trainable documentation
it mentions we should use the default_resource_request method to provide a static resource requirement, but when specifying {‘gpu’:1}, CUDA remains unavailable within the Trainable.step method for training. I also found this short guide
https://docs.ray.io/en/latest/tune/tutorials/tune-resources.html
which suggests I wrap my Trainable callable in the tune.with_resources method:
trainable_with_gpu = tune.with_resources(trainable, {“gpu”: 1})
tuner = tune.Tuner(
trainable_with_gpu,
tune_config=tune.TuneConfig(num_samples=10)
)
results = tuner.fit()
Both these suggestions do not make CUDA devices available within the Trainable.step() method for me. I should mention that torch correctly shows my devices are available outside the tuner.fit() call and I’ve tried reinstalling torch and cuda to no avail. Am I doing something wrong?
I am publicly hosting my project on my Github: EricPfleiderer/timeseries-forecasting/
You can checkout commit : 99ac544 (CUDA issue, temp removing wandb) specifically.
Thanks,
Eric