Cluster specs needed for training XGBoost model using XGBoostTrainer

gokulskumar · May 12, 2023, 6:35am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am trying to build an XGBoost model in a distributed manner using XGBoostTrainer and also tune its hyperparameters using Ray Tune. My cluster consists of 4 r6i.4xlarge nodes, each containing 16 CPUs and 128GB of memory. I running 2 concurrent trails, with num_workers = 2 for each XGBoostTrainer. My dataset is in the form of parquet files (200 partitions for train set and 100 partitions for val set) which is residing in S3 and I am loading it using Ray Dataset. I am facing an OOM issue while the training is happening.

In the below documentation about XGBoostTrainer, it is suggested that each node needs to have about 3x the size of the dataset in memory (I may be mistaken here).
https://docs.ray.io/en/latest/train/gbdt.html?highlight=xgboost#how-to-optimize-xgboost-memory-usage
But in the XGBoostTrainer benchmark provided below, the model was trained on a 100GB dataset with 10 m5.4xlarge instances, each having only 64GB memory, in which each node’s memory is less than the size of the dataset.
https://docs.ray.io/en/latest/ray-air/benchmarks.html#xgboost-training

Am I doing something wrong in selecting the type of nodes?

PS: I also tried restricting the no. of CPUs being used by each XGBoost actor by setting resources_per_worker = {'CPU' : 12'} but it was still using all the cores of each node.

Topic		Replies	Views
Ray xgboost ray not use GPU training and OOM Ray Train	0	150	April 30, 2024
XGboost-Ray Object Creation and Spilling bottleneck	5	504	July 8, 2023
XGBoost on Ray err when having more than 60 workers	0	218	August 3, 2023
XGBoost on Ray with extremely wide data Ray Train	5	437	June 5, 2023
DEADLINE_EXCEEDED when training using xgboost_ray on Sagemaker Ray Train	2	365	November 30, 2023

Cluster specs needed for training XGBoost model using XGBoostTrainer

Related topics