I am running a script with 960 workers (over 10 96-core machines) and many tasks, over google cloud (sent to a cluster via ray submit
)
Things seem to be progressing nicely, but I also keep getting the following exception:
2021-02-15 09:30:58,848 WARNING worker.py:1107 -- The agent on node ray-clusty-worker-bf715e14 failed with the following error:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 323, in <module>
loop.run_until_complete(agent.run())
File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 138, in run
modules = self._load_modules()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules
c = cls(self)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 76, in __init__
namespace="ray", port=metrics_export_port)))
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
options=option, gatherer=option.registry, collector=collector)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 266, in __init__
self.serve_http()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 321, in serve_http
port=self.options.port, addr=str(self.options.address))
File "/home/ray/anaconda3/lib/python3.7/site-packages/prometheus_client/exposition.py", line 79, in start_wsgi_server
httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
File "/home/ray/anaconda3/lib/python3.7/wsgiref/simple_server.py", line 153, in make_server
server = server_class((host, port), handler_class)
File "/home/ray/anaconda3/lib/python3.7/socketserver.py", line 452, in __init__
self.server_bind()
File "/home/ray/anaconda3/lib/python3.7/wsgiref/simple_server.py", line 50, in server_bind
HTTPServer.server_bind(self)
File "/home/ray/anaconda3/lib/python3.7/http/server.py", line 137, in server_bind
socketserver.TCPServer.server_bind(self)
File "/home/ray/anaconda3/lib/python3.7/socketserver.py", line 466, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
This is consistently from the same worker node. I can ssh into it and it seems to be working with all cores, performing the tasks. Other nodes are fine. Is this something I should be concerned about? Is there something I can do to prevent this?