How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
We are working on building a distributed training framework using Ray Trainer. We are using the
DataParallelTrainer class to run our workload in a dataparallel fashion. We are passing our training models to the
train_loop_per_worker argument using the
train_loop_config object. However, when we try to pass in a model of size 30 GB using
train_loop_config, we get an Out of Memory (OOM) error, even though the machine has 192 GB of RAM.
We have a few questions about this issue. First, is it expected for
train_loop_config to handle objects of this size? If not, is there another way to achieve the same using Ray?
We appreciate your help in resolving this issue. Thank you for your time and consideration.