Large cluster worker node - dashboard failing with "Address already in use"?

I am running a script with 960 workers (over 10 96-core machines) and many tasks, over google cloud (sent to a cluster via ray submit)

Things seem to be progressing nicely, but I also keep getting the following exception:

2021-02-15 09:30:58,848	WARNING worker.py:1107 -- The agent on node ray-clusty-worker-bf715e14 failed with the following error:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 323, in <module>
    loop.run_until_complete(agent.run())
  File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 138, in run
    modules = self._load_modules()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules
    c = cls(self)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 76, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
    options=option, gatherer=option.registry, collector=collector)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 266, in __init__
    self.serve_http()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 321, in serve_http
    port=self.options.port, addr=str(self.options.address))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/prometheus_client/exposition.py", line 79, in start_wsgi_server
    httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
  File "/home/ray/anaconda3/lib/python3.7/wsgiref/simple_server.py", line 153, in make_server
    server = server_class((host, port), handler_class)
  File "/home/ray/anaconda3/lib/python3.7/socketserver.py", line 452, in __init__
    self.server_bind()
  File "/home/ray/anaconda3/lib/python3.7/wsgiref/simple_server.py", line 50, in server_bind
    HTTPServer.server_bind(self)
  File "/home/ray/anaconda3/lib/python3.7/http/server.py", line 137, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/home/ray/anaconda3/lib/python3.7/socketserver.py", line 466, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

This is consistently from the same worker node. I can ssh into it and it seems to be working with all cores, performing the tasks. Other nodes are fine. Is this something I should be concerned about? Is there something I can do to prevent this?

Hey @Yoav , I think this is harmless - it just indicates that there might be a Dashboard process that leaked.

cc @eoakes can you take a look? Seems like it’d be nice to disable this warning (or disable the dashboard on the worker?)

1 Like

Each of the ray node has a component called dashboard agent, and it seems like the dashboard agent has a port conflict (so that it is not started). It won’t impact your main application at all, but your dashboard might not display information of the particular node that has dashboard agent issue.

Can you actually try specifying port numbers in this doc? This might reduce the probability of port confliction. Configuring Ray — Ray v1.1.0 We can figure out how to reduce the likelihood of this port conflict meanwhile.

2 Likes

Sure, which of these ports should be specified?

I recommend you to specify all ports, but probably

--min-worker-port: Minimum port number worker can be bound to. Default: 10000.

--max-worker-port: Maximum port number worker can be bound to. Default: 10999.

will be most useful to reduce port confilction.

Also, @eoakes Is this a known issue that dashboard agents are crashed with port conflicts and publish wrong errors to drivers?

Ok, got it thanks.
BTW, I wasn’t running anything else on these machines, so I am not sure what may have caused the conflicts. I am running with few n1-standard-96 nodes instead of many n1-standard-2 nodes, I don’t know if that might be related.

Yep. I think it is unlikely it was issue on your end. Probably our port assignment to the dashboard agent wasn’t robust enough, so it could happen in some cases (and cause port conflicts in one of many nodes with some low probability). If you see this often, please create an issue and tag me there @rkooo567. I will try handling it asap.

1 Like