RLlib evaluation rollout: socket.gaierror [Errno -2] Name or service not known

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Stacktrace:

(raylet) Traceback (most recent call last):
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/agent.py", line 391, in <module>
(raylet)     loop.run_until_complete(agent.run())
(raylet)   File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
(raylet)     return future.result()
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/agent.py", line 178, in run
(raylet)     modules = self._load_modules()
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/agent.py", line 120, in _load_modules
(raylet)     c = cls(self)
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 163, in __init__
(raylet)     dashboard_agent.metrics_export_port)
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/_private/metrics_agent.py", line 79, in __init__
(raylet)     address=metrics_export_address)))
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
(raylet)     options=option, gatherer=option.registry, collector=collector)
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
(raylet)     self.serve_http()
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
(raylet)     port=self.options.port, addr=str(self.options.address))
(raylet)   File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
(raylet)     TmpServer.address_family, addr = _get_best_family(addr, port)
(raylet)   File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 157, in _get_best_family
(raylet)     infos = socket.getaddrinfo(address, port)
(raylet)   File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
(raylet)     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
(raylet) socket.gaierror: [Errno -2] Name or service not known
(raylet) 
(raylet) During handling of the above exception, another exception occurred:
(raylet) 
(raylet) Traceback (most recent call last):
(raylet)   File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/agent.py", line 407, in <module>
(raylet)     gcs_publisher = GcsPublisher(args.gcs_address)
(raylet) TypeError: __init__() takes 1 positional argument but 2 were given

This happens when I use RLlib’s evaluate() script (ray/evaluate.py at master · ray-project/ray · GitHub) to perform rollouts and save their results.

Let test.py be a modified version of evaluate.py that registers custom environment. When I run ./test.py --run PPO --env “pheromone_env” --episodes 10 --out rollout.pkl, my rollout reader script:

import ray
import ray.cloudpickle as cloudpickle

objects = []
with (open("rollout.pkl", "rb")) as openfile:
    while True:
        try:
            objects.append(cloudpickle.load(openfile))
        except EOFError:
            break

print(objects)

returns [[[]]]
(empty). Do I have to modify the script for it to actually save something? Is the error above related to it?

Thank you for any help in advance.

2 Likes

We’ve just started seeing this too. Not specific to RLlib, but when launching Ray in general on Ubuntu.

Was there any feedback or responses from Ray committers?

I should add: we see this exception trace too when we try to use memory placement groups. First saw it with Ray 1.11 on Ubuntu, then we tried upgrading to Ray 1.13 – but still get this same error.

PS: does this have any relation to recent build errors that require a ​protobuf rollback to v3.20 ? That’s the only dependency in our build which appeared to change, concurrent with this error.