How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi Ray community!
Background:
-
I am training an XGBoost model using XGBoostTrainer. I’m fairly familiar with XGBoost (including the ray_xgboost library, and less so with XGBoostTrainer. Also very new to Ray Dataset)
-
In my loss function, I need access to a “per-row” weight.
-
Since I pass in data as a Ray Dataset, I need a way to access each actor’s shard’s indices of the data in the Ray Dataset.
-
I’m on Ray 2.7, and it’d be quite painful to upgrade…
Code Sample:
from ray.train import ScalingConfig
from ray.train.xgboost import XGBoostTrainer
import numpy as np
import xgboost as xgb
from typing import Tuple
ray_ds = ray.data.from_pandas_refs(objectstore_refs)
ray_ds = ray_ds.select_columns(XCOLS + [wcol] + ['y'])
def loss(predt: np.ndarray,
dtrain: xgb.DMatrix) -> Tuple[np.ndarray, np.ndarray]:
grad = grad * grad_weights
hess = hess
return grad, hess
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(
num_workers=2,
use_gpu=False,
resources_per_worker = {'CPU': 8},
),
label_column='y',
num_boost_round=20,
params={
"objective": "reg:squarederror",
"eval_metric": ["rmse"],
},
datasets={"train": ray_ds},
obj=loss,
)
result = trainer.fit()
print(result.metrics)
How I’d love this to work:
- In the
xgboost_ray
project, there is a private method called_get_sharding_indices
, which can be used to identify which data is located onto which actor. - In my
loss
function, I’d love to pre-load data per-shard, and then return the correctgrad_weights
as a function of which actor is calling.
Thank you so much in advance for any and all help!