The ray cluster setup by ray up cluster.yaml
most of the time gets stuck at starting the worker nodes with only the head nodes start up and occasionally starts the cluster with all the nodes started successfully when the cluster is either restarted by down down and then ray up or by ray up cluster.yaml --restart-only
below is the monitor.out log
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2023-05-05 18:16:16,262 INFO node_provider.py:54 -- ClusterState: Loaded cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
Fetched IP: 172.16.30.130
Warning: Permanently added '172.16.30.130' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
==> /tmp/ray/session_latest/logs/monitor.log <==
2023-05-05 18:15:10,981 INFO monitor.py:651 -- Starting monitor using ray installation: /usr/local/lib/python3.8/dist-packages/ray/__init__.py
2023-05-05 18:15:10,981 INFO monitor.py:652 -- Ray version: 2.3.0
2023-05-05 18:15:10,981 INFO monitor.py:653 -- Ray commit: cf7a56b4b0b648c324722df7c99c168e92ff0b45
2023-05-05 18:15:10,981 INFO monitor.py:654 -- Monitor started with command: ['/usr/local/lib/python3.8/dist-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2023-05-05_18-15-09_523603_127/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=172.16.30.130:1234', '--autoscaling-config=/root/ray_bootstrap_config.yaml', '--monitor-ip=172.16.30.130']
2023-05-05 18:15:10,984 INFO monitor.py:167 -- session_name: session_2023-05-05_18-15-09_523603_127
2023-05-05 18:15:10,985 INFO monitor.py:198 -- Starting autoscaler metrics server on port 44217
2023-05-05 18:15:10,986 INFO monitor.py:218 -- Monitor: Started
2023-05-05 18:15:11,007 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:11,007 INFO autoscaler.py:276 -- disable_node_updaters:False
2023-05-05 18:15:11,007 INFO autoscaler.py:284 -- disable_launch_config_check:False
2023-05-05 18:15:11,007 INFO autoscaler.py:296 -- foreground_node_launch:False
2023-05-05 18:15:11,007 INFO autoscaler.py:306 -- worker_liveness_check:True
2023-05-05 18:15:11,007 INFO autoscaler.py:314 -- worker_rpc_drain:True
2023-05-05 18:15:11,008 INFO autoscaler.py:364 -- StandardAutoscaler: {'cluster_name': 'default', 'auth': {'ssh_user': 'car'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 5, 'docker': {'image': 'test', 'container_name': 'ray_container', 'pull_before_run': False, 'run_options': ['--ulimit nofile=65536:65536', '--shm-size=11gb', '--device=/dev/dri:/dev/dri', '--env="DISPLAY"', '--expose 22', '--expose 8265', '--cap-add SYS_PTRACE', '--env-file=$HOME/.env', "$(nvidia-smi >> null && echo --gpus all || echo '')", '--volume=$HOME/.ssh/:/root/.ssh', '--volume=$SHARED_VOLUME:$SHARED_VOLUME', '--volume=/tmp/ray_logs:/tmp', '--volume=$HOME/car_logs:/logs']}, 'initialization_commands': [], 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': ['ray stop', 'ulimit -c unlimited && ray start --head --dashboard-host=0.0.0.0 --port=1234 --autoscaling-config=~/ray_bootstrap_config.yaml', "ln -sfn $(readlink -f /tmp/ray/session_latest | cut -d'/' -f4-) /tmp/ray/session_latest", '/prometheus-2.42.0.linux-amd64/prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml &', 'grafana-server -homepath /usr/share/grafana --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web &'], 'worker_start_ray_commands': ['ray stop', 'ray start --address=$RAY_HEAD_IP:1234 --resources=\'{"work\'${CAR}${NODEID}\'":\'${WORKRES}\',"det\'${CAR}${NODEID}\'":\'${DETRES}\',"feat\'${CAR}${NODEID}\'":\'${FEATRES}\'}\'', "ln -sfn $(readlink -f /tmp/ray/session_latest | cut -d'/' -f4-) /tmp/ray/session_latest"], 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': ['**/.git', '**/.git/**'], 'rsync_filter': ['.gitignore'], 'provider': {'type': 'local', 'head_ip': '172.16.30.130', 'worker_ips': ['10.60.62.65', '172.16.30.136']}, 'max_workers': 2, 'available_node_types': {'local.cluster.node': {'node_config': {}, 'resources': {}, 'min_workers': 2, 'max_workers': 2}}, 'head_node_type': 'local.cluster.node', 'no_restart': False}
2023-05-05 18:15:11,009 INFO monitor.py:388 -- Autoscaler has not yet received load metrics. Waiting.
2023-05-05 18:15:16,030 INFO autoscaler.py:143 -- The autoscaler took 0.001 seconds to fetch the list of non-terminated nodes.
2023-05-05 18:15:16,031 INFO autoscaler.py:419 --
======== Autoscaler status: 2023-05-05 18:15:16.031852 ========
Node status
---------------------------------------------------------------
Healthy:
1 local.cluster.node
Pending:
10.60.62.65: local.cluster.node, setting-up
172.16.30.136: local.cluster.node, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/8.0 CPU
0.0/1.0 GPU
0.0/1.0 accelerator_type:GTX
0.00/9.160 GiB memory
0.00/4.580 GiB object_store_memory
Demands:
(no resource demands)
2023-05-05 18:15:16,033 INFO autoscaler.py:586 -- StandardAutoscaler: Terminating the node with id 10.60.62.65 and ip 10.60.62.65. (outdated)
2023-05-05 18:15:16,034 INFO autoscaler.py:586 -- StandardAutoscaler: Terminating the node with id 172.16.30.136 and ip 172.16.30.136. (outdated)
2023-05-05 18:15:16,034 INFO node_provider.py:172 -- NodeProvider: 10.60.62.65: Terminating node
2023-05-05 18:15:16,034 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:16,034 INFO node_provider.py:172 -- NodeProvider: 172.16.30.136: Terminating node
2023-05-05 18:15:16,037 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:16,038 INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 2 new nodes for launch
2023-05-05 18:15:16,038 INFO autoscaler.py:462 -- The autoscaler took 0.008 seconds to complete the update iteration.
2023-05-05 18:15:16,038 INFO node_launcher.py:166 -- NodeLauncher0: Got 2 nodes to launch.
2023-05-05 18:15:16,039 INFO monitor.py:428 -- :event_summary:Resized to 8 CPUs, 1 GPUs.
2023-05-05 18:15:16,039 INFO monitor.py:428 -- :event_summary:Removing 2 nodes of type local.cluster.node (outdated).
2023-05-05 18:15:16,090 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:16,093 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:16,094 INFO node_launcher.py:166 -- NodeLauncher0: Launching 2 nodes, type local.cluster.node.
2023-05-05 18:15:21,061 INFO autoscaler.py:143 -- The autoscaler took 0.001 seconds to fetch the list of non-terminated nodes.
2023-05-05 18:15:21,062 INFO autoscaler.py:419 --
======== Autoscaler status: 2023-05-05 18:15:21.062314 ========
Node status
---------------------------------------------------------------
Healthy:
1 local.cluster.node
Pending:
10.60.62.65: local.cluster.node, uninitialized
172.16.30.136: local.cluster.node, uninitialized
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/8.0 CPU
0.0/1.0 GPU
0.0/1.0 accelerator_type:GTX
0.00/9.160 GiB memory
0.00/4.580 GiB object_store_memory
Demands:
(no resource demands)
2023-05-05 18:15:21,065 INFO autoscaler.py:1314 -- Creating new (spawn_updater) updater thread for node 10.60.62.65.
2023-05-05 18:15:21,066 INFO autoscaler.py:1314 -- Creating new (spawn_updater) updater thread for node 172.16.30.136.
==> /tmp/ray/session_latest/logs/monitor.out <==