Large cluster worker node - dashboard failing with "Address already in use"?

Yoav · February 15, 2021, 5:38pm

I am running a script with 960 workers (over 10 96-core machines) and many tasks, over google cloud (sent to a cluster via ray submit)

Things seem to be progressing nicely, but I also keep getting the following exception:

2021-02-15 09:30:58,848	WARNING worker.py:1107 -- The agent on node ray-clusty-worker-bf715e14 failed with the following error:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 323, in <module>
    loop.run_until_complete(agent.run())
  File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 138, in run
    modules = self._load_modules()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules
    c = cls(self)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 76, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
    options=option, gatherer=option.registry, collector=collector)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 266, in __init__
    self.serve_http()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 321, in serve_http
    port=self.options.port, addr=str(self.options.address))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/prometheus_client/exposition.py", line 79, in start_wsgi_server
    httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
  File "/home/ray/anaconda3/lib/python3.7/wsgiref/simple_server.py", line 153, in make_server
    server = server_class((host, port), handler_class)
  File "/home/ray/anaconda3/lib/python3.7/socketserver.py", line 452, in __init__
    self.server_bind()
  File "/home/ray/anaconda3/lib/python3.7/wsgiref/simple_server.py", line 50, in server_bind
    HTTPServer.server_bind(self)
  File "/home/ray/anaconda3/lib/python3.7/http/server.py", line 137, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/home/ray/anaconda3/lib/python3.7/socketserver.py", line 466, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

This is consistently from the same worker node. I can ssh into it and it seems to be working with all cores, performing the tasks. Other nodes are fine. Is this something I should be concerned about? Is there something I can do to prevent this?

rliaw · February 15, 2021, 6:46pm

Hey @Yoav , I think this is harmless - it just indicates that there might be a Dashboard process that leaked.

cc @eoakes can you take a look? Seems like it’d be nice to disable this warning (or disable the dashboard on the worker?)

sangcho · February 15, 2021, 6:47pm

Each of the ray node has a component called dashboard agent, and it seems like the dashboard agent has a port conflict (so that it is not started). It won’t impact your main application at all, but your dashboard might not display information of the particular node that has dashboard agent issue.

Can you actually try specifying port numbers in this doc? This might reduce the probability of port confliction. Configuring Ray — Ray v1.1.0 We can figure out how to reduce the likelihood of this port conflict meanwhile.

Yoav · February 15, 2021, 8:31pm

Sure, which of these ports should be specified?

sangcho · February 15, 2021, 9:06pm

I recommend you to specify all ports, but probably

--min-worker-port: Minimum port number worker can be bound to. Default: 10000.

--max-worker-port: Maximum port number worker can be bound to. Default: 10999.

will be most useful to reduce port confilction.

Also, @eoakes Is this a known issue that dashboard agents are crashed with port conflicts and publish wrong errors to drivers?

Yoav · February 15, 2021, 9:09pm

Ok, got it thanks.
BTW, I wasn’t running anything else on these machines, so I am not sure what may have caused the conflicts. I am running with few n1-standard-96 nodes instead of many n1-standard-2 nodes, I don’t know if that might be related.

sangcho · February 15, 2021, 10:37pm

Yep. I think it is unlikely it was issue on your end. Probably our port assignment to the dashboard agent wasn’t robust enough, so it could happen in some cases (and cause port conflicts in one of many nodes with some low probability). If you see this often, please create an issue and tag me there @rkooo567. I will try handling it asap.

Topic		Replies	Views
Ray cluster does is not creating workers? Ray Core	22	2867	April 26, 2021
Getting dashboard agent errors even with dashboard disabled Ray Clusters	3	448	May 17, 2021
Issue with ray cluster in Red hat machine Ray Clusters	1	478	August 26, 2022
Big cluster job failing due to SIGBUS in plasma Ray Core	16	908	July 12, 2021
GCP Cluster Worker Nodes fail to Initialize Ray Clusters	5	493	October 10, 2024

Large cluster worker node - dashboard failing with "Address already in use"?

Related topics