Does xgboost ray supports multi-output, many y labels?

I’m getting a similar error.

config = {
        'max_depth': 6,
        'gamma': 5,
        'reg_alpha': 5,
        'reg_lambda': 1,
        'colsample_bytree': 0.7,
        'min_child_weight': 2.1,
        'eval_metric': 'rmse',
        'n_estimators': 180,
        'early_stopping_rounds': 30

class ParquetToRayDMatrixDataLoader:
    def load_data(self, path: str, label_prefix: str = DEFAULT_LABEL_PREFIX) -> RayDMatrix:"Loading data from parquet to RayDMatrix: {path}")
        # This is required since we need to give label column name to RayDMatrix
        ds =
        label_cols = self.get_labels_by_prefix(columns=ds.schema().names, label_prefix=label_prefix)"Found label columns: {label_cols}, type of label_cols: {type(label_cols)}, length: {len(label_cols)}")
        return RayDMatrix(path, label= label_cols, filetype=RayFileType.PARQUET)

    def get_labels_by_prefix(columns: list, label_prefix: str) -> list:
        return [col for col in columns if col.startswith(label_prefix)]

data = ParquetToRayDMatrixDataLoader().load_data("my_gcp_path", label_prefix='residual')

result = {}
model = train(params=config, dtrain=data, early_stopping_rounds=int(config['early_stopping_rounds']),
                                   num_boost_round=int(config['n_estimators']), evals=[(data, 'train')],
                                   evals_result=result, verbose_eval=True, ray_params=RayParams(num_actors=2, cpus_per_actor=1))

And I’m getting:

INFO:ray_job_submitter.data_loaders.data_loader:Found label columns: ['residual'], type of label_cols: <class 'list'>, length: 1
IndexError: positional indexers are out-of-bounds

Although right now I only have one label column, so the label_cols is just [‘residual’], this issue is fixed if I hard code the label to be a single string, instead of a list… Does it mean that xgboost ray doesn’t support multi output models?

Seems related to : Xgboost_ray crashes when used for multiclass text classification - #5 by Y_C

After debugging, I found that it seems that multi output is not supported by design.

The _split_dataframe:, it expects label to be a pandas series instead of a data frame.

    def _split_dataframe(
        self, local_data: pd.DataFrame, data_source: Type[DataSource]
    ) -> Tuple[
        return (

And when loading the label data, when label is given as a list, it’s converted to a pandas series, instead of using the list as the list of column names.

    def get_column(
        cls, data: pd.DataFrame, column: Any
    ) -> Tuple[pd.Series, Optional[str]]:
        """Helper method wrapping around convert to series.

        This method should usually not be overwritten.
        if isinstance(column, str):
            return data[column], column
        elif column is not None:
            return cls.convert_to_series(column), None
        return column, None

Multi outputs have been added to XGBoost in version 1.6, so about 3 months ago. We haven’t had the chance, yet, to implement this for XGBoost-Ray.

It would be great if you could file an issue on the XGBoost-Ray repository. Also, if you’d be up for it, you could contribute the feature! Like many open source projects Ray and XGBoost-Ray benefit hugely from community contributions.

We’re very happy to help you in the process. Let me know!

Yes, add support for multi-output prediction · Issue #286 · ray-project/xgboost_ray · GitHub. Thanks, I’ll try to look into it.