the autoscaler tries take down one idle (false positive though, the node was running at 100% cpu) node but end up killing every nodes due to an internal key error. It seems to get confused with the mapping. this is a serious issue as all my progress get lost. I was using 16 placement group (on one machine each).
Blockquote
2021-02-17 14:55:34,817 INFO monitor.py:207 β :event_summary:Removing 1 nodes of type cpu_48_spot (idle).
2021-02-17 14:55:34,817 INFO monitor.py:207 β :event_summary:Adding 1 nodes of type cpu_48_spot.
2021-02-17 14:55:40,430 INFO load_metrics.py:102 β LoadMetrics: Removed mapping: 172.31.23.116 - 1613573430.7000167
2021-02-17 14:55:40,430 INFO load_metrics.py:109 β LoadMetrics: Removed 1 stale ip mappings: {β172.31.23.116β} not in {β172.31.16.240β, β172.31.27.173β, β172.31.26.163β, β172.31.20.177β, β172.31.25.79β, β172.31.28.159β, β172.31.21.227β, β172.31.24.131β, β172.31.31.164β, β172.31.22.24β, β172.31.26.41β, β172.31.19.126β, β172.31.22.66β, β172.31.26.13β, β172.31.30.105β, β172.31.25.157β, β172.31.27.26β}
2021-02-17 14:55:40,744 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:40,744 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).β,
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:55:46,909 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:46,909 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).β,
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:55:47,082 INFO monitor.py:207 β :event_summary:Resized to 724 CPUs.
2021-02-17 14:55:52,997 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:52,998 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).β,
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:55:58,965 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:58,965 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).β,
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:56:05,002 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:05,003 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).β,
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:56:10,999 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:11,000 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).β,
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:56:11,001 CRITICAL autoscaler.py:152 β StandardAutoscaler: Too many errors, abort.
2021-02-17 14:56:11,001 ERROR monitor.py:271 β Error in monitor loop
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/monitor.pyβ, line 269, in run
self._run()
File β/home/centos/.local/lib/python3.7/site-packages/ray/monitor.pyβ, line 202, in _run
self.autoscaler.update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 154, in update
raise e
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:56:11,002 ERROR autoscaler.py:724 β StandardAutoscaler: kill_workers triggered
2021-02-17 14:56:11,453 ERROR autoscaler.py:729 β StandardAutoscaler: terminated 16 node(s)
2021-02-17 14:56:11,453 INFO monitor.py:250 β Monitor: Exception caught. Taking down workersβ¦
2021-02-17 14:56:11,680 INFO monitor.py:262 β Monitor: Workers taken down.