How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to build an XGBoost model in a distributed manner using XGBoostTrainer and also tune its hyperparameters using Ray Tune. My cluster consists of 4 r6i.4xlarge nodes, each containing 16 CPUs and 128GB of memory. I running 2 concurrent trails, with num_workers = 2
for each XGBoostTrainer. My dataset is in the form of parquet files (200 partitions for train set and 100 partitions for val set) which is residing in S3 and I am loading it using Ray Dataset. I am facing an OOM issue while the training is happening.
In the below documentation about XGBoostTrainer, it is suggested that each node needs to have about 3x the size of the dataset in memory (I may be mistaken here).
https://docs.ray.io/en/latest/train/gbdt.html?highlight=xgboost#how-to-optimize-xgboost-memory-usage
But in the XGBoostTrainer benchmark provided below, the model was trained on a 100GB dataset with 10 m5.4xlarge instances, each having only 64GB memory, in which each node’s memory is less than the size of the dataset.
https://docs.ray.io/en/latest/ray-air/benchmarks.html#xgboost-training
Am I doing something wrong in selecting the type of nodes?
PS: I also tried restricting the no. of CPUs being used by each XGBoost actor by setting resources_per_worker = {'CPU' : 12'}
but it was still using all the cores of each node.