Racing condition in xgboost_ray training

mlts · August 9, 2024, 5:26pm

Hello, we are running into a racing condition when training xgboost_ray models. Basically we found that when we submitted multiple train jobs, even if each job by itself can be satisfied by the cluster’s resource, the jobs can deadlock each other. Lets say each job needs 10 actors, each actor 10 cpus, and there are 100 cpus in total, we observed that two jobs might each create 5 actors and wait for the other 5 forever. This is surprising to us because Ray should create all actors for the first job and let second job wait, instead of creating only half each. Can someone confirm this? Is there any way we can resolve this? Thanks in advance.

Sam_Chan · August 13, 2024, 5:12am

can you try allocating slightly more; ray head has some overhead hence why the deadlock. the ray log should show this; does it output that there isn’t enough resources available on the cluster?

EG: test 120 CPUs for 10 jobs of 10 CPUs each.

Topic		Replies	Views
[Ray Train] XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False Ray Train	0	13	May 26, 2025
Cluster specs needed for training XGBoost model using XGBoostTrainer Ray Train	0	315	May 12, 2023
XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False Ray Train	0	8	May 18, 2025
XGboost-Ray Object Creation and Spilling bottleneck	5	499	July 8, 2023
Ray train job gets killed with no errors! Ray Train	3	477	May 19, 2025

Racing condition in xgboost_ray training

Related topics