OOM when Passing Large Object to Ray Trainer Config

pratkpranav · July 12, 2023, 8:41am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

We are working on building a distributed training framework using Ray Trainer. We are using the DataParallelTrainer class to run our workload in a dataparallel fashion. We are passing our training models to the train_loop_per_worker argument using the train_loop_config object. However, when we try to pass in a model of size 30 GB using train_loop_config, we get an Out of Memory (OOM) error, even though the machine has 192 GB of RAM.

We have a few questions about this issue. First, is it expected for train_loop_config to handle objects of this size? If not, is there another way to achieve the same using Ray?

We appreciate your help in resolving this issue. Thank you for your time and consideration.

Yard1 · July 12, 2023, 6:55pm

Passing large objects in train_loop_config is not supported. It is recommended to instead instantiate the model (or any other large object) inside the train_loop_per_worker. You can use eg. S3 or NFS to ensure all workers have access to the files.

kai · July 16, 2023, 2:38pm

Just to add to this, one reason this may run into a OOM is if the workers are instantiated on the same machine. In this case the model will be duplicated for all workers. Note that this can also be the case when it’s just loaded in the train loop, not the config.

However, putting it into the config will incur more memory overhead as at least one copy will be stored in the ray object store, taking up another 30GB.

Topic		Replies	Views
Lightgbm Trainer for distribute training use too much memory Ray Train	1	55	January 27, 2025
How to pass objects to workers of Ray's DataParallelTrainer?	3	297	May 24, 2023
[Ray Train] Memory overloading rapidly while training TensorFlow model Ray Train	12	2270	February 24, 2023
Large dataset ray dataset OOM Ray Tune	2	449	July 3, 2023
OOM when I decoupled ray from GPTj finetune script Ray Train	0	241	November 17, 2023

OOM when Passing Large Object to Ray Trainer Config

Related topics