XGBoostTrainer access to indices of data in Ray Dataset

sjhermanek · April 12, 2024, 1:04pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi Ray community!

Background:

I am training an XGBoost model using XGBoostTrainer. I’m fairly familiar with XGBoost (including the ray_xgboost library, and less so with XGBoostTrainer. Also very new to Ray Dataset)
In my loss function, I need access to a “per-row” weight.
Since I pass in data as a Ray Dataset, I need a way to access each actor’s shard’s indices of the data in the Ray Dataset.
I’m on Ray 2.7, and it’d be quite painful to upgrade…

Code Sample:

from ray.train import ScalingConfig
from ray.train.xgboost import XGBoostTrainer

import numpy as np
import xgboost as xgb
from typing import Tuple

ray_ds = ray.data.from_pandas_refs(objectstore_refs)
ray_ds = ray_ds.select_columns(XCOLS + [wcol] + ['y'])

def loss(predt: np.ndarray,
                dtrain: xgb.DMatrix) -> Tuple[np.ndarray, np.ndarray]:
   
    grad = grad * grad_weights
    hess = hess
    return grad, hess

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        num_workers=2,
        use_gpu=False,
        resources_per_worker = {'CPU': 8},
    ),
    label_column='y',
    num_boost_round=20,
    params={
        "objective": "reg:squarederror",
        "eval_metric": ["rmse"],
    },
    datasets={"train": ray_ds},
    obj=loss,
)

result = trainer.fit()
print(result.metrics)

How I’d love this to work:

In the xgboost_ray project, there is a private method called _get_sharding_indices, which can be used to identify which data is located onto which actor.
In my loss function, I’d love to pre-load data per-shard, and then return the correct grad_weights as a function of which actor is calling.

Thank you so much in advance for any and all help!

Topic		Replies	Views
Understanding distributed data loading and training xgboost ray Ray Data	10	945	July 19, 2023
XGBoostTrainer -- Distributed Weights Not Working?	7	220	September 13, 2024
Distributed data loading using Ray Data with XGBoost official (or XGBoost Sklearn) model	1	311	August 26, 2022
XGboost-Ray Object Creation and Spilling bottleneck	5	495	July 8, 2023
[Train] Using Datasets is MUCH slower then instantiating data in workers	0	66	August 27, 2024

XGBoostTrainer access to indices of data in Ray Dataset

Related topics