Hello, we are running into a racing condition when training xgboost_ray models. Basically we found that when we submitted multiple train jobs, even if each job by itself can be satisfied by the cluster’s resource, the jobs can deadlock each other. Lets say each job needs 10 actors, each actor 10 cpus, and there are 100 cpus in total, we observed that two jobs might each create 5 actors and wait for the other 5 forever. This is surprising to us because Ray should create all actors for the first job and let second job wait, instead of creating only half each. Can someone confirm this? Is there any way we can resolve this? Thanks in advance.
can you try allocating slightly more; ray head has some overhead hence why the deadlock. the ray log should show this; does it output that there isn’t enough resources available on the cluster?
EG: test 120 CPUs for 10 jobs of 10 CPUs each.