How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
We are working on building a distributed training framework using Ray Trainer. We are using the DataParallelTrainer
class to run our workload in a dataparallel fashion. We are passing our training models to the train_loop_per_worker
argument using the train_loop_config
object. However, when we try to pass in a model of size 30 GB using train_loop_config
, we get an Out of Memory (OOM) error, even though the machine has 192 GB of RAM.
We have a few questions about this issue. First, is it expected for train_loop_config
to handle objects of this size? If not, is there another way to achieve the same using Ray?
We appreciate your help in resolving this issue. Thank you for your time and consideration.