I have an upstream pipeline (using Ray Workflows) which produces DataFrames and stores them in Ray’s object store memory. I want to use this data for XGBoost training. RayDMatrix expects de-referenced objects as an input (either a DataFrame or a file path). How can I construct RayDMatrix using existing object_store objects?
Hi thanks for the question.
Ray workflow is still in experimental stage. I will try to see who owns this.
I believe it’s @chengsu. Or he can direct us
@gjoliver, thanks for reply. The question is not really about workflows and more about XGBoost on ray, as workflows are only an orchestration layer to fetch the data. There is a similar scenario where a user prefers to load data into the cluster using plain ray tasks as a part of their own custom logic and use the data for training XGBoost, hence the question.
You should be able to construct a
RayDMatrix and pass a list of references to pandas dataframes in the constructor (provided you have at least as many references as there are XGBoost training workers). See xgboost_ray/simple_objectstore.py at master · ray-project/xgboost_ray · GitHub