WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information

OmarAlmusa · May 18, 2025, 6:38pm

Severity of the issue: High

Environment:

Ray version: 2.44.1
Python version: 3.10.16
OS: WSL
Cloud/Infrastructure: -
Other libs/tools (if relevant):
Name: xgboost
Version: 3.0.0

My Rig shown in Ray status:

0.0/28.0 CPU
0.0/1.0 GPU
0B/11.98GiB memory
0B/5.13GiB object_store_memory

Issue:
I was following the exact same code provided in docs: Get Started with Distributed Training using PyTorch — Ray 2.46.0 and it only worked for the case when scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True) was set as following:
num_workers=1
use_gpu=True

However, I encountered issues when I decided to run on multiple CPUs and turn off the gpu as following:
num_workers=4
use_gpu=False
and got this error message repeatedly regardless of how many times i set different values for num_workers:

The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors

Thinking that the issue might be with pytorch I tried a simpler example to train xgboost by following the doc Get Started with Distributed Training using XGBoost — Ray 2.46.0 on iris dataset provided by scikit learn, the code:

import ray
import ray.train
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback
from ray.data.preprocessors import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.datasets import load_iris

import xgboost

if ray.is_initialized():
    ray.shutdown()

ray.init()

iris_data = load_iris(as_frame=True)

ray_iris = ray.data.from_pandas(iris_data['frame'])

training_split, eval_split = ray_iris.train_test_split(test_size=0.2)

def train_func():
    # 2. Load your data shard as an `xgboost.DMatrix`.

    # Get dataset shards for this worker
    train_shard = ray.train.get_dataset_shard("train")
    eval_shard = ray.train.get_dataset_shard("eval")

    # Convert shards to pandas DataFrames
    train_df = train_shard.materialize().to_pandas()
    eval_df = eval_shard.materialize().to_pandas()

    train_X = train_df.drop("target", axis=1)
    train_y = train_df["target"]
    eval_X = eval_df.drop("target", axis=1)
    eval_y = eval_df["target"]

    dtrain = xgboost.DMatrix(train_X, label=train_y)
    deval = xgboost.DMatrix(eval_X, label=eval_y)

    # 3. Define your xgboost model training parameters.
    params = {
        "tree_method": "approx",
        "objective": "reg:squarederror",
        "eta": 1e-4,
        "subsample": 0.5,
        "max_depth": 2,
    }

    # 4. Do distributed data-parallel training.
    # Ray Train sets up the necessary coordinator processes and
    # environment variables for your workers to communicate with each other.
    bst = xgboost.train(
        params,
        dtrain=dtrain,
        evals=[(deval, "validation")],
        num_boost_round=10,
        # Optional: Use the `RayTrainReportCallback` to save and report checkpoints.
        callbacks=[RayTrainReportCallback()],
    )

# 5. Configure scaling and resource requirements.
scaling_config = ray.train.ScalingConfig(num_workers=2, resources_per_worker={"CPU": 2})

# 6. Launch distributed training job.
trainer = XGBoostTrainer(
    train_func,
    scaling_config=scaling_config,
    datasets={"train": training_split, "eval": eval_split},
    # If running in a multi-node cluster, this is where you
    # should configure the run's persistent storage that is accessible
    # across all worker nodes.
    # run_config=ray.train.RunConfig(storage_path="s3://..."),
)
result = trainer.fit()

and got this same error:

...
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
...
TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = XGBoostTrainer.restore("/home/linux_ubuntu/ray_results/XGBoostTrainer_2025-05-18_18-46-29")`.
To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or `max_failures = -1` for unlimited retries.

the xgboost training didn’t work even when num_workers=1
I tried everything from upgrade/degrading python versions, trying the latest ray version, setting the maximum memory for ray when initializating… etc but nothing seems to work, I’m frustrated at this point because I’ve been trying to solve this issue for 6 days now…

Topic		Replies	Views
[Ray Train] XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False Ray Train	0	10	May 26, 2025
XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False Ray Train	0	7	May 18, 2025
XGBoost on Ray err when having more than 60 workers	0	218	August 3, 2023
A worker died or was killed while executing a task by an unexpected system error Ray Tune	6	4111	May 8, 2023
When does a `Worker` fail to set `core_worker`? Ray Core	3	116	October 4, 2024

WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information

Related topics