XGBoostTrainer -- Distributed Weights Not Working?

When trying to use the weights for XGBoostTrainer, like:

train_weights_ds = train_set.select_columns(['weight'])

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        num_workers=16,
        use_gpu=True,
    ),
    early_stopping_rounds=10,
    dmatrix_params={"train": {'weight': train_weights_ds}, },
    ...
)

I am met with data size mismatches, suggesting that weights are not being sharded in line with each shard to sent to each worker. Is it possible to attach weights to each worker?

Check failed: weights_.Size() == num_row_ (92711999 vs. 3862999) : Size of weights must equal to number of rows.

Can you use the weight column name instead?

-    dmatrix_params={"train": {'weight': train_weights_ds}, },
+    dmatrix_params={"train": {'weight': 'weight'}, },

Works great, thanks!

Hi Matt, I tested this and got message saying dmatrix_params is deprecated and suggested me to use dataset_config instead to customize Ray Dataset ingestion. Can you advise how to use dataset_config to assign the weight column?

Hey @Will1 thanks for pointing this out, looks like this was accidentally removed.

I’ve put up a PR that will add this back.

A workaround for current Ray versions would be to use the V2 API, which would allow you to customize your XGBoost training code more flexibly!

Hi @matthewdeng , thanks for your response and generating the PR for the future release. Do you mind giving me an example about how to use the V2 API to correctly set the observation weight column?

Hey @Will1, following up on this it turns out my PR actually would not solve the problem, but the V2 API would.

The way to do so is to update these lines and pass in the weight parameter there, which at that point should be a column in train_df or eval_df.

Thanks for the follow-up. This is so helpful!