Medium: It contributes to significant difficulty to complete my task, but I can work around it.
About 30% (anecdotal) of the time that I go to launch a new Ray cluster on AWS the head node gets stuck in ray-node-status “setting-up” stage and the “ray up” command just hangs (I’ve waited an hour when it usually takes about 5 mins to setup). My workaround is to kill the cluster and try again and it usually works. I am having a hard time finding more logs which could point to an error.
- I am using ray 2.3.1, launching from MacOS 12, on AWS Ubuntu 18 image ami-0dd6adfad4ad37eec. I am using Docker with the base image rayproject/ray:2.3.1-py38-cpu.
- I am running “ray up” command and I get no error message but the output hangs right after printing “Local node IP: XXX.XX.XX.XXX”. I looked at the ray up script code and very little happens between printing Local Node IP and the finishing up of the script which is doesn’t give me too much to look into.
- I compared the logs between a successful launch and a launch that hangs and couldn’t discern any important differences.
- When my process hangs, pressing Ctrl-C doesn’t do anything on my Mac. But if I manually terminate the AWS machine via the UI I get “Shared connection to XX.XXX.XX.XXX closed.”
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
- I do however see big differences in the processes on the head node which are running which are attributable to ray. On a hanging machine I see the following processes:
bash --login -c -i true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it blue-whale-ray-cpu /bin/bash -c 'bash --login -c -i '"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_USAGE_STATS_ENABLED=1;export RAY_OV
docker exec -it blue-whale-ray-cpu /bin/bash -c bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_USAGE_STATS_ENABLED=1;export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":2}'"'"';ray start --head --port=6379 --object-manager-port=8076 --autoscali
bash --login -c -i true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_USAGE_STATS_ENABLED=1;export RAY_OVERRIDE_RESOURCES='{"CPU":2}';ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=/home/ray/ray_bootstrap_config.yaml)
bash --login -c -i true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_USAGE_STATS_ENABLED=1;export RAY_OVERRIDE_RESOURCES='{"CPU":2}';ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=/home/ray/ray_bootstrap_config.yaml)
/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=/home/ray/ray_bootstrap_config.yaml
/home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2023-04-25_11-35-04_569925_175/logs --config_list=XXXX
My cluster config is here