Pipelining/streaming data for distributed XGBoostTrainer training/validation

dirtyValera · November 29, 2022, 5:41am

From this tutorial it looks like to train/validate a model we need to have all of the data loaded in memory beforehand.

My data depends on a number of preprocessing steps in upstream pipeline (using Ray Actors which write to shared obj store) and also is too large to fit into a cluster memory. I want to be able to configure the trainer so it waits for the chunk of training data from upstream and trains on it as soon as it arrives. Same for validation step. How do I do it?

Yard1 · November 29, 2022, 10:54am

XGBoost requires all of the data to be loaded in memory for the training. The algorithm itself doesn’t support batch learning. Therefore, using streaming for datasets is not possible here.

Topic		Replies	Views
Distributed data loading using Ray Data with XGBoost official (or XGBoost Sklearn) model	1	321	August 26, 2022
XGboost-Ray Object Creation and Spilling bottleneck	5	504	July 8, 2023
Loading data for XGBoost_ray Ray Tune	5	409	July 22, 2021
Understanding distributed data loading and training xgboost ray Ray Data	10	993	July 19, 2023
Loading large datasets from HDFS for xgboost on Yarn Ray Data	2	766	October 14, 2023

Pipelining/streaming data for distributed XGBoostTrainer training/validation

Related topics