XGBoost on Ray with extremely wide data

mlts · May 26, 2023, 5:42pm

Hello, I am wondering what is the impact of extremely wide data on xgboost on ray. We have a dataset that has more than 18k columns. We noticed some huge overhead when training on this data. The object store seems to be 10-30x of the data size. Loading the data also takes very long time.

BTW the xgboost ray doc suggests there is no point creating more than one worker on each node because xgboost can multi-thread by itself. However we noticed that the data loading operation on each worker is single threaded. If we only have one worker per node the data loading will take forever. We found that we had to create many workers on each node to speed up data loading.

gjoliver · May 27, 2023, 5:33pm

@Yard1 do you have any advice for this XGBoost on Ray workload?

Yard1 · May 27, 2023, 5:47pm

The newest release of xgboost-ray will parallelize data loading, which should help. As far as I understand it, Ray Datasets are not optimized for wide datasets (@amogkam may have a better idea here).

Another thing you may want to consider is feature parallel training with lightgbm-ray, however we do not support that directly. It should be quite easy to add support, though, if you are interested in that - we would be happy to help. Distributed Learning Guide — LightGBM 3.3.5.99 documentation

mlts · May 28, 2023, 8:12pm

Thanks a lot for the reply. I am using release 2.3.1. Can you explain a bit why wide data causes more overhead? A very brief explanation is enough.

I would love to be able to do parallel training with lgbm. Thanks again for the help!

sjl · June 5, 2023, 5:34pm

It’s difficult to guess an exact reason for the increased overhead. @mlts would you be able to provide a minimal example of the code that you are using? We may have resolved a similar issue with a different user previously, so wanted to see if this was potentially related. Thanks!

Jules_Damji · June 5, 2023, 5:55pm

@mlts let us know if the responses from @Yard1 and @sjl address your concerns.

Topic		Replies	Views
XGboost-Ray Object Creation and Spilling bottleneck	5	500	July 8, 2023
Distributed data loading using Ray Data with XGBoost official (or XGBoost Sklearn) model	1	316	August 26, 2022
XGBoost + large data with Ray Tune: How? Ray Data	0	20	July 29, 2025
[Train] Using Datasets is MUCH slower then instantiating data in workers	0	76	August 27, 2024
Understanding distributed data loading and training xgboost ray Ray Data	10	970	July 19, 2023

XGBoost on Ray with extremely wide data

Related topics