Learning from large static datasets

ekblad · April 1, 2022, 6:13pm

High: It blocks me to complete my task.

Hi everyone,

My use case for RLLib entails learning from large ensemble climate datasets. The observation space are images in the 100-pixel range, with channel dimension of the number of climate fields I consider in a given experiment (precipitation, temperature, pressure, etc.). The climate system exogenously forces regional human systems on short time scales, so this can be considered a static dataset outside of the actions and rewards.

I was wondering, of all of the capabilities of Ray/RLLib, how to go about relaxing the data flow constraints of my application. I have encountered significant scaling issues when porting my application to HPC resources (4 GPU x 100+ CPU). There is probably a way to pass around an object reference or memory map instead of the images themselves, but I haven’t been able to make this work effectively. Ideally, I would create the dataset, store it somewhere that I can read it efficiently, and have pointers to the dataset instead of the actual images when needed.

I have tried sharing a single dataset in the object store, but with on the order of millions of files this becomes difficult. I would like to know if input readers or an offline dataset (or something else) could help. Is there a tool specifically designed for the case where part of the observation is static and does not need to be copied?

Thanks!

Lars_Simon_Zehnder · April 3, 2022, 9:06am

Hi @ekblad,

from what I hear, this sounds like an industry-grade application and there we had usually a preprocessing and lazy evaluation on edges (nodes). In regard to this I would point to RayDP to keep your data in a Spark cluster and rely on the lazy evaluation of Spark to then preprocess the data on Spark and pull it up to Ray.

Hope this helps,
Simon

Paul · April 10, 2022, 3:05pm

Hi @ekblad , I was dealing with a classification problem and get into similar situation. I found this solution of using Dask: Training models when data doesn’t fit in memory | Guilherme’s Blog

Let me know how it goes.
Paul

Topic		Replies	Views
Memory exhausting problem when using Dataset (from ray.data) with RLLib RLlib	2	265	October 12, 2022
Accessing Large Static Datasets with Ray Clusters Ray Libraries (Data, Train, Tune, Serve)	3	416	May 27, 2023
Is it possible to have dataloaders in RLlib? RLlib	0	14	October 31, 2024
Optimal way to load in a common dataset to an RL env when using many workers RLlib	5	298	July 5, 2022
What is the difference between Ray and Spark?	8	10919	May 1, 2021

Learning from large static datasets

Related topics