XGBoost on Ray err when having more than 60 workers

Hello, we encountered a strange issue when training XGBoost on Ray GPU clusters. This error happens when we have 60 or more workers, each workers with 4 GPU cores. Any smaller cluster does not cause this err.

The err msg from notebooks is as follows:
ARNING worker.py:2019 – A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff51e00cfba6650631154a8d4f01000000 Worker ID: c9e799d83ba5d15f5bdc6fe4d470771a8dc5aa1c8597a06e5cc4c157 Node ID: 442fae2da1339bea3fc6e8499e2b95041d9b26e7d48165ae2cdd1f3b Worker IP address: xxx Worker port: 10005 Worker PID: 1287 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

The err msg from the head node process is:
Some Ray subprocesses exited unexpectedly:

  • dashboard [exit code=255]*
    Remaining processes will be killed.