Memory exhausting problem when using Dataset (from ray.data) with RLLib

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a moderate size dataset saved in a bunch of pickle files that I load into a ray.data.dataset.Dataset using the function .from_items() . I am using a static dataset since the generation of the environment state on the fly requires a long time. Every agent have access to the same Dataset instance and randomly samples a portion of the main dataset using .random_samples(fraction), in order to distribute the dataset evenly among the agent workers.

Whenever I run out of samples in the random ‘partition’, I pick another .random_samples(fraction) and work on a new set of samples. This is to avoid that every agent while training in multiple workers carries a copy of the whole dataset. The problem is that at every call of .random_samples(fraction), the memory occupied by the process increase considerably. You can see these two plots that show the increase of memory percent utilization (on the right) that happens with the spikes in the CPU utilization (on the left) when the new random samples are drawn from the dataset during training.

This behaviour eventually fills my memory, making it impossible to finish the experiment (consider that the machine I am using has 500GB of RAM). Any idea of how to deal with this?

NOTE that I am using Ray 1.13.0, should I consider upgrading my library version?

UPDATE: I have tried with Ray 2.0.0 and I have the same issue.

Can you please log a github issue with a reproduction script? That would be great :slight_smile:

Thanks for reporting this!