OOM when Passing Large Object to Ray Trainer Config

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

We are working on building a distributed training framework using Ray Trainer. We are using the DataParallelTrainer class to run our workload in a dataparallel fashion. We are passing our training models to the train_loop_per_worker argument using the train_loop_config object. However, when we try to pass in a model of size 30 GB using train_loop_config, we get an Out of Memory (OOM) error, even though the machine has 192 GB of RAM.

We have a few questions about this issue. First, is it expected for train_loop_config to handle objects of this size? If not, is there another way to achieve the same using Ray?

We appreciate your help in resolving this issue. Thank you for your time and consideration.

Passing large objects in train_loop_config is not supported. It is recommended to instead instantiate the model (or any other large object) inside the train_loop_per_worker. You can use eg. S3 or NFS to ensure all workers have access to the files.

Just to add to this, one reason this may run into a OOM is if the workers are instantiated on the same machine. In this case the model will be duplicated for all workers. Note that this can also be the case when it’s just loaded in the train loop, not the config.

However, putting it into the config will incur more memory overhead as at least one copy will be stored in the ray object store, taking up another 30GB.