How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
Hi, I have implemented a custom training flow by subclassing ray.rllib.algorithms.algorithms.Algorithm and overriding the training_step(). In the override i am using train_one_step(). I followed the example from the “two trainer workflow”.
I am using tune.Tuner to train it. I have a GPU in the system and it shows up in ray.cluster_resources(). However, it doesn’t seem to use the GPU for training.
I do ray init with num_gpus=1, wrap the Algorithm/trainable in the with_resources() wrapper and specify cpu and gpu usage like so :
It seems like you have correctly set up the GPU usage for your custom training algorithm with Ray and Tune. However, there are a few things you might want to check:
Check if your custom algorithm is designed to use GPUs: Even if you allocate GPUs to your trials, your custom algorithm needs to be implemented in a way that it can leverage GPU computation. For instance, if you’re using PyTorch, you should ensure that your model and data are moved to the GPU using .to(device) where device is a torch.device object representing your GPU.
Check if CUDA is visible to your trials, setting the gpu resource in tune.with_resources() should automatically set the CUDA_VISIBLE_DEVICES environment variable for each trial. If gpu is not set, CUDA_VISIBLE_DEVICES will be set as empty, disallowing GPU access. You might want to check if this is happening correctly.
Check for GPU memory issues: As mentioned in the Ray documentation, you may occasionally run into GPU memory issues when running a new trial. This may be due to the previous trial not cleaning up its GPU state fast enough. To avoid this, you can use tune.utils.wait_for_gpu.
Fractional GPUs: If you have multiple trials running concurrently and you want them to share the GPU, you can use fractional GPUs as described in the Ray documentation. This allows each trial to use a fraction of the GPU memory.
I override the training_step in my custom algo, which has N policies. I handle the batching in the training_step, then use train_one_step(). Lets say under the hood I am training a DQN and the Custom Algo is a wrapper. So I am not moving the data around myself and leaving it to RLLib.
As train_one_step is not using GPU, I need to pinpoint why. Is it because of my custom batching? If rllib sets training device from the batch device somewhere in train_one_step(), I can see this happening.
Thank you for your other suggestions, I had already checked those settings and they seem to be correct.