XGBoostTrainer access to indices of data in Ray Dataset

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Ray community!

Background:

  • I am training an XGBoost model using XGBoostTrainer. I’m fairly familiar with XGBoost (including the ray_xgboost library, and less so with XGBoostTrainer. Also very new to Ray Dataset)

  • In my loss function, I need access to a “per-row” weight.

  • Since I pass in data as a Ray Dataset, I need a way to access each actor’s shard’s indices of the data in the Ray Dataset.

  • I’m on Ray 2.7, and it’d be quite painful to upgrade…

Code Sample:

from ray.train import ScalingConfig
from ray.train.xgboost import XGBoostTrainer

import numpy as np
import xgboost as xgb
from typing import Tuple

ray_ds = ray.data.from_pandas_refs(objectstore_refs)
ray_ds = ray_ds.select_columns(XCOLS + [wcol] + ['y'])

def loss(predt: np.ndarray,
                dtrain: xgb.DMatrix) -> Tuple[np.ndarray, np.ndarray]:
   
    grad = grad * grad_weights
    hess = hess
    return grad, hess

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        num_workers=2,
        use_gpu=False,
        resources_per_worker = {'CPU': 8},
    ),
    label_column='y',
    num_boost_round=20,
    params={
        "objective": "reg:squarederror",
        "eval_metric": ["rmse"],
    },
    datasets={"train": ray_ds},
    obj=loss,
)

result = trainer.fit()
print(result.metrics)

How I’d love this to work:

  • In the xgboost_ray project, there is a private method called _get_sharding_indices, which can be used to identify which data is located onto which actor.
  • In my loss function, I’d love to pre-load data per-shard, and then return the correct grad_weights as a function of which actor is calling.

Thank you so much in advance for any and all help!