How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
We are trying to use ray autoscalar to start a ray cluster over an on-prem cluster. We are facing the following two issues with different ray versions.
We are using the following autoscalar yaml file:
autoscalar
auth:
ssh_user: root
cluster_name: default
cluster_synced_files: []
file_mounts: {}
file_mounts_sync_continuously: false
head_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -c unlimited && RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=10000000000 ray start --head --port=6379 --num-cpus=10
idle_timeout_minutes: 10000
initialization_commands: []
max_workers: 19
min_workers: 19
provider:
head_ip: 192.168.123.1
type: local
worker_ips:
- 192.168.123.2
- 192.168.123.3
- 192.168.123.4
- 192.168.123.5
- 192.168.123.6
- 192.168.123.7
- 192.168.123.8
- 192.168.123.9
- 192.168.123.10
- 192.168.123.11
- 192.168.123.12
- 192.168.123.13
- 192.168.123.14
- 192.168.123.15
- 192.168.123.16
- 192.168.123.17
- 192.168.123.18
- 192.168.123.19
- 192.168.123.20
rsync_exclude:
- '**/.git'
- '**/.git/**'
rsync_filter:
- .gitignore
setup_commands: []
upscaling_speed: 1.0
worker_setup_commands: []
worker_start_ray_commands:
- ray stop
- RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=10000000000 ray start --address=$RAY_HEAD_IP:6379
With ray==2.2.0
The cluster starts for the head node, however then the autoscalar is stuck and does not start the cluster on any of the other nodes.
This is the monitor.log
file.
monitor.log
2023-07-02 23:34:53,273 INFO monitor.py:605 -- Starting monitor using ray installation: /usr/local/lib/python3.8/dist-packages/ray/__init__.py
2023-07-02 23:34:53,273 INFO monitor.py:606 -- Ray version: 2.2.0
2023-07-02 23:34:53,273 INFO monitor.py:607 -- Ray commit: b6af0887ee5f2e460202133791ad941a41f15beb
2023-07-02 23:34:53,274 INFO monitor.py:608 -- Monitor started with command: ['/usr/local/lib/python3.8/dist-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2023-07-02_23-34-51_852257_34327/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=172.29.216.17:6379', '--autoscaling-config=~/ray_bootstrap_config.yaml', '--monitor-ip=172.29.216.17']
2023-07-02 23:34:53,278 INFO monitor.py:196 -- Starting autoscaler metrics server on port 44217
2023-07-02 23:34:53,279 INFO monitor.py:216 -- Monitor: Started
2023-07-02 23:34:53,287 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:53,288 INFO autoscaler.py:269 -- disable_node_updaters:False
2023-07-02 23:34:53,288 INFO autoscaler.py:277 -- disable_launch_config_check:False
2023-07-02 23:34:53,288 INFO autoscaler.py:289 -- foreground_node_launch:False
2023-07-02 23:34:53,288 INFO autoscaler.py:299 -- worker_liveness_check:True
2023-07-02 23:34:53,288 INFO autoscaler.py:307 -- worker_rpc_drain:True
2023-07-02 23:34:53,288 INFO autoscaler.py:355 -- StandardAutoscaler: {'cluster_name': 'default', 'auth': {'ssh_user': 'root'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 10000, 'docker': {}, 'initialization_commands': [], 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': ['ray stop', 'ulimit -c unlimited && RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=10000000000 ray start --head --port=6379 --num-cpus=10 --autoscaling-config=~/ray_bootstrap_config.yaml --system-config=\'{"raylet_heartbeat_period_milliseconds":100000,"num_heartbeats_timeout":20000000,"worker_register_timeout_seconds":500}\''], 'worker_start_ray_commands': ['ray stop', 'RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=10000000000 ray start --address=$RAY_HEAD_IP:6379'], 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': ['**/.git', '**/.git/**'], 'rsync_filter': ['.gitignore'], 'max_workers': 19, 'provider': {'head_ip': '192.168.123.1', 'type': 'local', 'worker_ips': ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20']}, 'available_node_types': {'local.cluster.node': {'node_config': {}, 'resources': {}, 'min_workers': 19, 'max_workers': 19}}, 'head_node_type': 'local.cluster.node', 'no_restart': False}
2023-07-02 23:34:53,290 INFO monitor.py:363 -- Autoscaler has not yet received load metrics. Waiting.
2023-07-02 23:34:58,301 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-07-02 23:34:58,301 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: 172.29.216.17.
2023-07-02 23:34:58,301 INFO load_metrics.py:166 -- LoadMetrics: Removed 1 stale ip mappings: {'172.29.216.17'} not in {'192.168.123.1'}
2023-07-02 23:34:58,302 INFO autoscaler.py:409 --
======== Autoscaler status: 2023-07-02 23:34:58.301998 ========
Node status
---------------------------------------------------------------
Healthy:
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
Demands:
(no resource demands)
2023-07-02 23:34:58,303 INFO autoscaler.py:1356 -- StandardAutoscaler: Queue 19 new nodes for launch
2023-07-02 23:34:58,303 INFO autoscaler.py:452 -- The autoscaler took 0.002 seconds to complete the update iteration.
2023-07-02 23:34:58,303 INFO node_launcher.py:164 -- NodeLauncher1: Got 5 nodes to launch.
2023-07-02 23:34:58,303 INFO node_launcher.py:164 -- NodeLauncher0: Got 5 nodes to launch.
2023-07-02 23:34:58,304 INFO monitor.py:382 -- :event_summary:Resized to 10 CPUs.
2023-07-02 23:34:58,306 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,307 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,308 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,309 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,310 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,311 INFO node_launcher.py:164 -- NodeLauncher0: Launching 5 nodes, type local.cluster.node.
2023-07-02 23:34:58,311 INFO node_launcher.py:164 -- NodeLauncher0: Got 5 nodes to launch.
2023-07-02 23:34:58,311 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,313 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,314 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,315 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,316 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,316 INFO node_launcher.py:164 -- NodeLauncher0: Launching 5 nodes, type local.cluster.node.
2023-07-02 23:34:58,316 INFO node_launcher.py:164 -- NodeLauncher0: Got 4 nodes to launch.
2023-07-02 23:34:58,317 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,318 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,319 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,320 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,321 INFO node_launcher.py:164 -- NodeLauncher0: Launching 4 nodes, type local.cluster.node.
2023-07-02 23:34:58,354 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,356 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,357 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,358 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,359 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.123.2', '192.168.123.3', '192.168.123.4', '192.168.123.5', '192.168.123.6', '192.168.123.7', '192.168.123.8', '192.168.123.9', '192.168.123.10', '192.168.123.11', '192.168.123.12', '192.168.123.13', '192.168.123.14', '192.168.123.15', '192.168.123.16', '192.168.123.17', '192.168.123.18', '192.168.123.19', '192.168.123.20', '192.168.123.1']
2023-07-02 23:34:58,359 INFO node_launcher.py:164 -- NodeLauncher1: Launching 5 nodes, type local.cluster.node.
2023-07-02 23:35:03,321 INFO autoscaler.py:141 -- The autoscaler took 0.003 seconds to fetch the list of non-terminated nodes.
2023-07-02 23:35:03,321 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: 172.29.216.17.
2023-07-02 23:35:03,321 INFO load_metrics.py:166 -- LoadMetrics: Removed 1 stale ip mappings: {'172.29.216.17'} not in {'192.168.123.6', '192.168.123.10', '192.168.123.3', '192.168.123.9', '192.168.123.19', '192.168.123.8', '192.168.123.18', '192.168.123.14', '192.168.123.15', '192.168.123.4', '192.168.123.16', '192.168.123.5', '192.168.123.17', '192.168.123.20', '192.168.123.1', '192.168.123.12', '192.168.123.2', '192.168.123.13', '192.168.123.7', '192.168.123.11'}
2023-07-02 23:35:03,324 INFO autoscaler.py:409 --
======== Autoscaler status: 2023-07-02 23:35:03.323964 ========
Node status
---------------------------------------------------------------
Healthy:
Pending:
192.168.123.2: local.cluster.node, uninitialized
192.168.123.3: local.cluster.node, uninitialized
192.168.123.4: local.cluster.node, uninitialized
192.168.123.5: local.cluster.node, uninitialized
192.168.123.6: local.cluster.node, uninitialized
192.168.123.7: local.cluster.node, uninitialized
192.168.123.8: local.cluster.node, uninitialized
192.168.123.9: local.cluster.node, uninitialized
192.168.123.10: local.cluster.node, uninitialized
192.168.123.11: local.cluster.node, uninitialized
192.168.123.12: local.cluster.node, uninitialized
192.168.123.13: local.cluster.node, uninitialized
192.168.123.14: local.cluster.node, uninitialized
192.168.123.15: local.cluster.node, uninitialized
192.168.123.16: local.cluster.node, uninitialized
192.168.123.17: local.cluster.node, uninitialized
192.168.123.18: local.cluster.node, uninitialized
192.168.123.19: local.cluster.node, uninitialized
192.168.123.20: local.cluster.node, uninitialized
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
Demands:
(no resource demands)
2023-07-02 23:35:03,342 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.2.
2023-07-02 23:35:03,342 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.3.
2023-07-02 23:35:03,342 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.4.
2023-07-02 23:35:03,342 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.5.
2023-07-02 23:35:03,343 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.6.
2023-07-02 23:35:03,343 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.7.
2023-07-02 23:35:03,343 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.8.
2023-07-02 23:35:03,344 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.9.
2023-07-02 23:35:03,344 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.10.
2023-07-02 23:35:03,344 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.11.
2023-07-02 23:35:03,345 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.12.
2023-07-02 23:35:03,346 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.13.
2023-07-02 23:35:03,346 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.14.
2023-07-02 23:35:03,346 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.15.
2023-07-02 23:35:03,347 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.16.
2023-07-02 23:35:03,347 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.17.
2023-07-02 23:35:03,347 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.18.
2023-07-02 23:35:03,348 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.19.
2023-07-02 23:35:03,348 INFO autoscaler.py:1304 -- Creating new (spawn_updater) updater thread for node 192.168.123.20.
However, with ray==2.5.0 however,
The ray command fails with following error.
Error
root@worker1:~# ray start
Traceback (most recent call last):
File "/usr/local/bin/ray", line 5, in <module>
from ray.scripts.scripts import main
File "/usr/local/lib/python3.8/dist-packages/ray/scripts/scripts.py", line 2430, in <module>
from ray.util.state.state_cli import (
File "/usr/local/lib/python3.8/dist-packages/ray/util/state/__init__.py", line 1, in <module>
from ray.util.state.api import (
File "/usr/local/lib/python3.8/dist-packages/ray/util/state/api.py", line 17, in <module>
from ray.util.state.common import (
File "/usr/local/lib/python3.8/dist-packages/ray/util/state/common.py", line 120, in <module>
@dataclass(init=True)
File "/usr/local/lib/python3.8/dist-packages/pydantic/dataclasses.py", line 139, in dataclass
assert init is False, 'pydantic.dataclasses.dataclass only supports init=False'
AssertionError: pydantic.dataclasses.dataclass only supports init=False
Any suggestion regarding this issue would be helpful.