Ray split data unevenly across GPUs

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a weird issue which seems to me it is a memory leak but I am not sure what exactly is the source. For some reason, there is memory spike across some GPUs as can be seen in the following picture during training and this will lead to OOM at some point. The interesting part is that initially all the gpus have even memory utilization but through time this is increasing and some might have more than the others. Any idea how I can investigate and find the source and potentially resolve the issue?

The solution for this problem in our case was to reduce prefetch_batch. I wasn’t expecting it to put the data on the GPUs as well. Well it seems it is the case but I thought you did separate GPU and CPU prefetch post [data] Set the prefetch depth separately for GPU-preloading in iter_batches · Issue #35305 · ray-project/ray · GitHub. It seems it’s not the case in the most recent version.