Ray monitoring fails when binding to empty address

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m learning to use RLlib. I’ve been running it in my debugger on an example script, and it works, but for some reason I get an error message about the monitoring service failing. This is the traceback:

File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 600, in <module>
  monitor = Monitor(
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 205, in __init__
  logger.exception(
File "/usr/lib/python3.10/logging/__init__.py", line 1512, in exception
  self.error(msg, *args, exc_info=exc_info, **kwargs)
File "/usr/lib/python3.10/logging/__init__.py", line 70, in error
File "/usr/lib/python3.10/logging/__init__.py", line 1911, in _LogErrorReplacement
  msg,
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 199, in __init__
  prometheus_client.start_http_server(
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
  TmpServer.address_family, addr = _get_best_family(addr, port)
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
  infos = socket.getaddrinfo(address, port)
File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
  for res in _socket.getaddrinfo(host, port, family, type, proto, flags):

socket.gaierror: [Errno -5] No address associated with hostname

I’m trying to understand why this bug is happening and how I can fix it. The hostname it’s trying to use is '', which sounds like something that shouldn’t work. Working my way up the traceback, I see that in ray/autoscaler/_private/monitor.py line 201, there’s this logic:

addr="127.0.0.1" if head_node_ip == "127.0.0.1" else "",

Since in my case, head_node_ip is equal to '192.168.1.116', the else clause is used and an empty address is passed on getaddrinfo.

I’m not sure what the logic of this code is. Can getaddrinfo even work with an empty string? How does this service work for people normally? How do I make it not fail?

cc @architkulkarni ?

I think this might be the same problem described here: https://github.com/ray-project/ray/pull/23766

Can you check what version of prometheus-client you’re using (pip show prometheus_client) where you’re seeing this error?

Here are all the versions of everything:

absl-py==1.2.0
aiosignal==1.2.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.0.8
astunparse==1.6.3
attrs==22.1.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
cachetools==5.2.0
certifi==2022.9.14
cffi==1.15.1
charset-normalizer==2.1.1
click==8.0.4
cloudpickle==2.2.0
contourpy==1.0.5
cycler==0.11.0
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
distlib==0.3.6
dm-tree==0.1.7
entrypoints==0.4
executing==1.0.0
fastjsonschema==2.16.1
filelock==3.8.0
flatbuffers==2.0.7
fonttools==4.37.2
frozenlist==1.3.1
gast==0.4.0
google-auth==2.11.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.43.0
gym==0.23.1
gym-notices==0.0.8
h5py==3.7.0
idna==3.4
imageio==2.21.3
ipykernel==6.15.3
ipython==8.5.0
ipython-genutils==0.2.0
ipywidgets==8.0.2
jedi==0.18.1
Jinja2==3.1.2
jsonschema==4.16.0
jupyter==1.0.0
jupyter-console==6.4.4
jupyter-core==4.11.1
jupyter_client==7.3.5
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.3
keras==2.10.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.4
libclang==14.0.6
lxml==4.9.1
lz4==4.0.2
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.6.0
matplotlib-inline==0.1.6
mistune==2.0.4
msgpack==1.0.4
nbclient==0.6.8
nbconvert==7.0.0
nbformat==5.5.0
nest-asyncio==1.5.5
networkx==2.8.6
notebook==6.4.12
numpy==1.23.3
oauthlib==3.2.1
opt-einsum==3.3.0
packaging==21.3
pandas==1.4.4
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
platformdirs==2.5.2
prometheus-client==0.14.1
prompt-toolkit==3.0.31
protobuf==3.19.5
psutil==5.9.2
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-dateutil==2.8.2
pytz==2022.2.1
PyWavelets==1.4.0
PyYAML==6.0
pyzmq==24.0.0
qtconsole==5.3.2
QtPy==2.2.0
ray==2.0.0
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
scikit-image==0.19.3
scipy==1.9.1
Send2Trash==1.8.0
six==1.16.0
soupsieve==2.3.2.post1
stack-data==0.5.0
tabulate==0.8.10
tensorboard==2.10.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorboardX==2.5.1
tensorflow==2.10.0
tensorflow-estimator==2.10.0
tensorflow-io-gcs-filesystem==0.27.0
termcolor==2.0.1
terminado==0.15.0
tifffile==2022.8.12
tinycss2==1.1.1
tornado==6.2
traitlets==5.4.0
typing_extensions==4.3.0
urllib3==1.26.12
virtualenv==20.16.5
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==2.2.2
widgetsnbextension==4.0.3
wrapt==1.14.1

Looks like you’re on 0.14.1 which has this breaking change. So I think you could either downgrade the prometheus-client version with pip install prometheus-client==0.13, or use a Ray nightly wheel that will have the compatibility fix in.

I didn’t know it was a known bug. Where is the issue for this bug?

Oops, I’m tired and I missed a message. I just saw that the issue link was posted above. Sorry.