XGBoost on Ray with extremely wide data

Hello, I am wondering what is the impact of extremely wide data on xgboost on ray. We have a dataset that has more than 18k columns. We noticed some huge overhead when training on this data. The object store seems to be 10-30x of the data size. Loading the data also takes very long time.

BTW the xgboost ray doc suggests there is no point creating more than one worker on each node because xgboost can multi-thread by itself. However we noticed that the data loading operation on each worker is single threaded. If we only have one worker per node the data loading will take forever. We found that we had to create many workers on each node to speed up data loading.

@Yard1 do you have any advice for this XGBoost on Ray workload?

The newest release of xgboost-ray will parallelize data loading, which should help. As far as I understand it, Ray Datasets are not optimized for wide datasets (@amogkam may have a better idea here).

Another thing you may want to consider is feature parallel training with lightgbm-ray, however we do not support that directly. It should be quite easy to add support, though, if you are interested in that - we would be happy to help. Distributed Learning Guide — LightGBM documentation

Thanks a lot for the reply. I am using release 2.3.1. Can you explain a bit why wide data causes more overhead? A very brief explanation is enough.

I would love to be able to do parallel training with lgbm. Thanks again for the help!

It’s difficult to guess an exact reason for the increased overhead. @mlts would you be able to provide a minimal example of the code that you are using? We may have resolved a similar issue with a different user previously, so wanted to see if this was potentially related. Thanks!

@mlts let us know if the responses from @Yard1 and @sjl address your concerns.