Hello, I am wondering what is the impact of extremely wide data on xgboost on ray. We have a dataset that has more than 18k columns. We noticed some huge overhead when training on this data. The object store seems to be 10-30x of the data size. Loading the data also takes very long time.
BTW the xgboost ray doc suggests there is no point creating more than one worker on each node because xgboost can multi-thread by itself. However we noticed that the data loading operation on each worker is single threaded. If we only have one worker per node the data loading will take forever. We found that we had to create many workers on each node to speed up data loading.
The newest release of xgboost-ray will parallelize data loading, which should help. As far as I understand it, Ray Datasets are not optimized for wide datasets (@amogkam may have a better idea here).
Another thing you may want to consider is feature parallel training with lightgbm-ray, however we do not support that directly. It should be quite easy to add support, though, if you are interested in that - we would be happy to help. Distributed Learning Guide — LightGBM 3.3.5.99 documentation
It’s difficult to guess an exact reason for the increased overhead. @mlts would you be able to provide a minimal example of the code that you are using? We may have resolved a similar issue with a different user previously, so wanted to see if this was potentially related. Thanks!