Ray status couldn't find the Ray cluster despite a working cluster

  • Low: It annoys or frustrates me for a moment.

Greetings! I’m still testing the stability of the Ray cluster that I set up on a few computers connected in a local network. I start the cluster by running ray start and I monitor the cluster from both the dashboard and ray status on the head node.

An issue I ran into is that the ray status command worked initially after the Ray cluster was started but then it started to complain Ray cluster is not found at 192.168.1.100:6379 due to deadline exceeded errors. Interestingly, the Ray cluster was actually still running: the existing and new jobs still finished and the dashboard was able to update according to the workload. It would be nice to have ray status working stably as it shows the number of pending tasks, which is not available on the dashboard.

Error message:

Traceback (most recent call last):
  File "/home/xyz/.local/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 120, in check_health
    resp = stub.CheckAlive(req, timeout=timeout)
  File "/home/xyz/.local/lib/python3.10/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/xyz/.local/lib/python3.10/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.DEADLINE_EXCEEDED
        details = "Deadline Exceeded"
        debug_error_string = "{"created":"@1666548607.243184041","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>
Ray cluster is not found at 192.168.1.100:6379

I’m using Ray 2.0.0.

Hi @zzb3886, sorry you’re running into this. It is interesting that the dashboard is working but the status API is not working. Could you please file an issue on Github with any more details you have about this (e.g. when does it happen, and how often does it happen?)

Thank you. I have created the issue [Ray Cluster] Cannot connect to head after submitting jobs · Issue #29696 · ray-project/ray · GitHub.

1 Like