Pipelining/streaming data for distributed XGBoostTrainer training/validation

From this tutorial it looks like to train/validate a model we need to have all of the data loaded in memory beforehand.

My data depends on a number of preprocessing steps in upstream pipeline (using Ray Actors which write to shared obj store) and also is too large to fit into a cluster memory. I want to be able to configure the trainer so it waits for the chunk of training data from upstream and trains on it as soon as it arrives. Same for validation step. How do I do it?

XGBoost requires all of the data to be loaded in memory for the training. The algorithm itself doesn’t support batch learning. Therefore, using streaming for datasets is not possible here.