Learning from large static datasets

  • High: It blocks me to complete my task.

Hi everyone,

My use case for RLLib entails learning from large ensemble climate datasets. The observation space are images in the 100-pixel range, with channel dimension of the number of climate fields I consider in a given experiment (precipitation, temperature, pressure, etc.). The climate system exogenously forces regional human systems on short time scales, so this can be considered a static dataset outside of the actions and rewards.

I was wondering, of all of the capabilities of Ray/RLLib, how to go about relaxing the data flow constraints of my application. I have encountered significant scaling issues when porting my application to HPC resources (4 GPU x 100+ CPU). There is probably a way to pass around an object reference or memory map instead of the images themselves, but I haven’t been able to make this work effectively. Ideally, I would create the dataset, store it somewhere that I can read it efficiently, and have pointers to the dataset instead of the actual images when needed.

I have tried sharing a single dataset in the object store, but with on the order of millions of files this becomes difficult. I would like to know if input readers or an offline dataset (or something else) could help. Is there a tool specifically designed for the case where part of the observation is static and does not need to be copied?

Thanks!

3 Likes

Hi @ekblad,

from what I hear, this sounds like an industry-grade application and there we had usually a preprocessing and lazy evaluation on edges (nodes). In regard to this I would point to RayDP to keep your data in a Spark cluster and rely on the lazy evaluation of Spark to then preprocess the data on Spark and pull it up to Ray.

Hope this helps,
Simon

Hi @ekblad , I was dealing with a classification problem and get into similar situation. I found this solution of using Dask: Training models when data doesn’t fit in memory | Guilherme’s Blog

Let me know how it goes.
Paul

1 Like