Severity of the issue: High
Environment:
- Ray version: 2.44.1
- Python version: 3.10.16
- OS: WSL
- Cloud/Infrastructure: -
- Other libs/tools (if relevant):
Name: xgboost
Version: 3.0.0
My Rig shown in Ray status:
- 0.0/28.0 CPU
- 0.0/1.0 GPU
- 0B/11.98GiB memory
- 0B/5.13GiB object_store_memory
Issue:
I was following the exact same code provided in docs: Get Started with Distributed Training using PyTorch — Ray 2.46.0 and it only worked for the case when scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True)
was set as following:
num_workers=1
use_gpu=True
However, I encountered issues when I decided to run on multiple CPUs and turn off the gpu as following:
num_workers=4
use_gpu=False
and got this error message repeatedly regardless of how many times i set different values for num_workers
:
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors
Thinking that the issue might be with pytorch I tried a simpler example to train xgboost by following the doc Get Started with Distributed Training using XGBoost — Ray 2.46.0 on iris dataset provided by scikit learn, the code:
import ray
import ray.train
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback
from ray.data.preprocessors import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.datasets import load_iris
import xgboost
if ray.is_initialized():
ray.shutdown()
ray.init()
iris_data = load_iris(as_frame=True)
ray_iris = ray.data.from_pandas(iris_data['frame'])
training_split, eval_split = ray_iris.train_test_split(test_size=0.2)
def train_func():
# 2. Load your data shard as an `xgboost.DMatrix`.
# Get dataset shards for this worker
train_shard = ray.train.get_dataset_shard("train")
eval_shard = ray.train.get_dataset_shard("eval")
# Convert shards to pandas DataFrames
train_df = train_shard.materialize().to_pandas()
eval_df = eval_shard.materialize().to_pandas()
train_X = train_df.drop("target", axis=1)
train_y = train_df["target"]
eval_X = eval_df.drop("target", axis=1)
eval_y = eval_df["target"]
dtrain = xgboost.DMatrix(train_X, label=train_y)
deval = xgboost.DMatrix(eval_X, label=eval_y)
# 3. Define your xgboost model training parameters.
params = {
"tree_method": "approx",
"objective": "reg:squarederror",
"eta": 1e-4,
"subsample": 0.5,
"max_depth": 2,
}
# 4. Do distributed data-parallel training.
# Ray Train sets up the necessary coordinator processes and
# environment variables for your workers to communicate with each other.
bst = xgboost.train(
params,
dtrain=dtrain,
evals=[(deval, "validation")],
num_boost_round=10,
# Optional: Use the `RayTrainReportCallback` to save and report checkpoints.
callbacks=[RayTrainReportCallback()],
)
# 5. Configure scaling and resource requirements.
scaling_config = ray.train.ScalingConfig(num_workers=2, resources_per_worker={"CPU": 2})
# 6. Launch distributed training job.
trainer = XGBoostTrainer(
train_func,
scaling_config=scaling_config,
datasets={"train": training_split, "eval": eval_split},
# If running in a multi-node cluster, this is where you
# should configure the run's persistent storage that is accessible
# across all worker nodes.
# run_config=ray.train.RunConfig(storage_path="s3://..."),
)
result = trainer.fit()
and got this same error:
...
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
...
TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = XGBoostTrainer.restore("/home/linux_ubuntu/ray_results/XGBoostTrainer_2025-05-18_18-46-29")`.
To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or `max_failures = -1` for unlimited retries.
the xgboost training didn’t work even when num_workers=1
I tried everything from upgrade/degrading python versions, trying the latest ray version, setting the maximum memory for ray when initializating… etc but nothing seems to work, I’m frustrated at this point because I’ve been trying to solve this issue for 6 days now…