Autoscaler (with placement_group) on AWS killing every nodes while taking down one idle node

the autoscaler tries take down one idle (false positive though, the node was running at 100% cpu) node but end up killing every nodes due to an internal key error. It seems to get confused with the mapping. this is a serious issue as all my progress get lost. I was using 16 placement group (on one machine each).

Blockquote
2021-02-17 14:55:34,817 INFO monitor.py:207 – :event_summary:Removing 1 nodes of type cpu_48_spot (idle).
2021-02-17 14:55:34,817 INFO monitor.py:207 – :event_summary:Adding 1 nodes of type cpu_48_spot.
2021-02-17 14:55:40,430 INFO load_metrics.py:102 – LoadMetrics: Removed mapping: 172.31.23.116 - 1613573430.7000167
2021-02-17 14:55:40,430 INFO load_metrics.py:109 – LoadMetrics: Removed 1 stale ip mappings: {β€˜172.31.23.116’} not in {β€˜172.31.16.240’, β€˜172.31.27.173’, β€˜172.31.26.163’, β€˜172.31.20.177’, β€˜172.31.25.79’, β€˜172.31.28.159’, β€˜172.31.21.227’, β€˜172.31.24.131’, β€˜172.31.31.164’, β€˜172.31.22.24’, β€˜172.31.26.41’, β€˜172.31.19.126’, β€˜172.31.22.66’, β€˜172.31.26.13’, β€˜172.31.30.105’, β€˜172.31.25.157’, β€˜172.31.27.26’}
2021-02-17 14:55:40,744 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:40,744 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:55:46,909 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:46,909 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:55:47,082 INFO monitor.py:207 – :event_summary:Resized to 724 CPUs.
2021-02-17 14:55:52,997 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:52,998 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:55:58,965 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:58,965 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:56:05,002 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:05,003 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:56:10,999 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:11,000 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:56:11,001 CRITICAL autoscaler.py:152 – StandardAutoscaler: Too many errors, abort.
2021-02-17 14:56:11,001 ERROR monitor.py:271 – Error in monitor loop
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/monitor.py”, line 269, in run
self._run()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/monitor.py”, line 202, in _run
self.autoscaler.update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 154, in update
raise e
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:56:11,002 ERROR autoscaler.py:724 – StandardAutoscaler: kill_workers triggered
2021-02-17 14:56:11,453 ERROR autoscaler.py:729 – StandardAutoscaler: terminated 16 node(s)
2021-02-17 14:56:11,453 INFO monitor.py:250 – Monitor: Exception caught. Taking down workers…
2021-02-17 14:56:11,680 INFO monitor.py:262 – Monitor: Workers taken down.

@Alex
@ericl
should we not auto-terminate failed nodes and cap the number of failures to 10 by default?

This looks like a bug, I don’t see how it’s related to our policy around terminating failed nodes. Can we track this on github?

1 Like