[Ray Train] XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False

OmarAlmusa · May 26, 2025, 8:38pm

Severity of the issue: High

Environment:

Ray version: 2.44.1
Python version: 3.10.16
OS: WSL
Cloud/Infrastructure: none
Other libs/tools (if relevant):
xgboost 3.0.0

Ray resources:

0.0/28.0 CPU
0.0/1.0 GPU
0B/11.98GiB memory
0B/5.13GiB object_store_memory

Issue:
I’m running into a frustrating issue when using XGBoostTrainer in Ray. I followed the official PyTorch training guide: Get Started with Distributed Training using PyTorch — Ray 2.46.0
When I used:

num_workers=1
use_gpu=True

it worked fine. But the moment I set:

use_gpu=False
num_workers=2  # or any value > 1

the whole thing dies with this error:
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file.
It fails both with my original PyTorch training code and with this minimal XGBoostTrainer repro using the Iris dataset (full code below) following this doc Get Started with Distributed Training using XGBoost — Ray 2.46.0.

import pandas as pd
import ray
import ray.train
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback
from ray.data.preprocessors import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.datasets import load_iris

import xgboost

if ray.is_initialized():
    ray.shutdown()

ray.init()

iris_data = load_iris(as_frame=True)

ray_iris = ray.data.from_pandas(iris_data['frame'])
ray_iris.schema() 

training_split, eval_split = ray_iris.train_test_split(test_size=0.2)

def train_func():
    # 2. Load your data shard as an `xgboost.DMatrix`.

    # Get dataset shards for this worker
    train_shard = ray.train.get_dataset_shard("train")
    eval_shard = ray.train.get_dataset_shard("eval")

    # Convert shards to pandas DataFrames
    train_df = train_shard.materialize().to_pandas()
    eval_df = eval_shard.materialize().to_pandas()

    train_X = train_df.drop("target", axis=1)
    train_y = train_df["target"]
    eval_X = eval_df.drop("target", axis=1)
    eval_y = eval_df["target"]

    dtrain = xgboost.DMatrix(train_X, label=train_y)
    deval = xgboost.DMatrix(eval_X, label=eval_y)

    # 3. Define your xgboost model training parameters.
    params = {
        "tree_method": "approx",
        "objective": "reg:squarederror",
        "eta": 1e-4,
        "subsample": 0.5,
        "max_depth": 2,
    }

    # 4. Do distributed data-parallel training.
    # Ray Train sets up the necessary coordinator processes and
    # environment variables for your workers to communicate with each other.
    bst = xgboost.train(
        params,
        dtrain=dtrain,
        evals=[(deval, "validation")],
        num_boost_round=10,
        # Optional: Use the `RayTrainReportCallback` to save and report checkpoints.
        callbacks=[RayTrainReportCallback()],
    )

# 5. Configure scaling and resource requirements.
scaling_config = ray.train.ScalingConfig(num_workers=2, resources_per_worker={"CPU": 2})

# 6. Launch distributed training job.
trainer = XGBoostTrainer(
    train_func,
    scaling_config=scaling_config,
    datasets={"train": training_split, "eval": eval_split},
)
result = trainer.fit()

I want to understand why this breaks with multiple workers on CPU, even on a small dataset like Iris. And more importantly, how to make this work. If there’s something obvious I’m missing about CPU-only Ray setup or some config for multiprocessing DMatrix sharing—please tell me.

Appreciate any pointers.

Topic		Replies	Views
XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False Ray Train	0	8	May 18, 2025
WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information Ray Train	0	11	May 18, 2025
XGBoost on Ray err when having more than 60 workers	0	218	August 3, 2023
Understanding distributed data loading and training xgboost ray Ray Data	10	960	July 19, 2023
Ray train job gets killed with no errors! Ray Train	3	477	May 19, 2025

[Ray Train] XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False

Related topics