[Ray Train] XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False

Severity of the issue: High

Environment:

  • Ray version: 2.44.1
  • Python version: 3.10.16
  • OS: WSL
  • Cloud/Infrastructure: none
  • Other libs/tools (if relevant):
    xgboost 3.0.0

Ray resources:

  • 0.0/28.0 CPU
  • 0.0/1.0 GPU
  • 0B/11.98GiB memory
  • 0B/5.13GiB object_store_memory

Issue:
I’m running into a frustrating issue when using XGBoostTrainer in Ray. I followed the official PyTorch training guide: Get Started with Distributed Training using PyTorch — Ray 2.46.0
When I used:

num_workers=1
use_gpu=True

it worked fine. But the moment I set:

use_gpu=False
num_workers=2  # or any value > 1

the whole thing dies with this error:
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file.
It fails both with my original PyTorch training code and with this minimal XGBoostTrainer repro using the Iris dataset (full code below) following this doc Get Started with Distributed Training using XGBoost — Ray 2.46.0.

import pandas as pd
import ray
import ray.train
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback
from ray.data.preprocessors import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.datasets import load_iris

import xgboost

if ray.is_initialized():
    ray.shutdown()

ray.init()

iris_data = load_iris(as_frame=True)

ray_iris = ray.data.from_pandas(iris_data['frame'])
ray_iris.schema() 

training_split, eval_split = ray_iris.train_test_split(test_size=0.2)

def train_func():
    # 2. Load your data shard as an `xgboost.DMatrix`.

    # Get dataset shards for this worker
    train_shard = ray.train.get_dataset_shard("train")
    eval_shard = ray.train.get_dataset_shard("eval")

    # Convert shards to pandas DataFrames
    train_df = train_shard.materialize().to_pandas()
    eval_df = eval_shard.materialize().to_pandas()

    train_X = train_df.drop("target", axis=1)
    train_y = train_df["target"]
    eval_X = eval_df.drop("target", axis=1)
    eval_y = eval_df["target"]

    dtrain = xgboost.DMatrix(train_X, label=train_y)
    deval = xgboost.DMatrix(eval_X, label=eval_y)

    # 3. Define your xgboost model training parameters.
    params = {
        "tree_method": "approx",
        "objective": "reg:squarederror",
        "eta": 1e-4,
        "subsample": 0.5,
        "max_depth": 2,
    }

    # 4. Do distributed data-parallel training.
    # Ray Train sets up the necessary coordinator processes and
    # environment variables for your workers to communicate with each other.
    bst = xgboost.train(
        params,
        dtrain=dtrain,
        evals=[(deval, "validation")],
        num_boost_round=10,
        # Optional: Use the `RayTrainReportCallback` to save and report checkpoints.
        callbacks=[RayTrainReportCallback()],
    )

# 5. Configure scaling and resource requirements.
scaling_config = ray.train.ScalingConfig(num_workers=2, resources_per_worker={"CPU": 2})

# 6. Launch distributed training job.
trainer = XGBoostTrainer(
    train_func,
    scaling_config=scaling_config,
    datasets={"train": training_split, "eval": eval_split},
)
result = trainer.fit()

I want to understand why this breaks with multiple workers on CPU, even on a small dataset like Iris. And more importantly, how to make this work. If there’s something obvious I’m missing about CPU-only Ray setup or some config for multiprocessing DMatrix sharing—please tell me.

Appreciate any pointers.