Having trouble connecting to head node

I have two machines on my local network. I have already setup windows network sharing and the two machines are discoverable and accessible to each other. The machines both have the same virtual environments, python version, and ray versions installed.

I can successfully initialize the head node cluster via ray start --head. I receive the return message to use ray start --address=‘127.0.0.1:6379’ to connect from another node.

When I run this command on the other machine, I receive this error message:
Unable to connect to GCS at 127.0.0.1:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

I have already confirmed matching version and I do not see any potential firewall issues, but I also do not know for common firewall settings to check. I do know that the network sharing is open and I can navigate inside and access the other computer.

On the first machine, you’re running ray start --head.

On the second machine, you’re running ray start --address=‘127.0.0.1:6379‘.

I believe the issue is that 127.0.0.1 is incorrect (the fact that we’re printing that address looks like our fault!). The second machine needs the address of the first machine, but 127.0.0.1 always refers to the local machine.

Could you figure out the IP address of the first machine and replace 127.0.0.1 with that? You might be able to do this by running something like

import socket
print(socket.gethostbyname(socket.gethostname()))

on the first machine, though I’m not 100% sure.

Let me know if this works or if it makes sense.

1 Like

thank you for responding! I found the address, lets say 192.168.x.x and ran the following (with the true IP with ‘x’ substituted for real address)

ray start --head --address=192.168.x.x

Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
Specifying --address for external Redis address is deprecated. Please specify environment variable RAY_REDIS_ADDRESS=192.168.x.x instead.
Will use `192.168.x.x` as external Redis server address(es). If the primary one is not reachable, we starts new one(s) with `--port` in local.
The primary external redis server `192.168.x.x` is not reachable. Will starts new one(s) with `--port` in local.
Local node IP: 127.0.0.1

So when specifying that address explicitly, it reverts to the local. Also, the 192.168.x.x address can be seen by only machines on the network correct? If so, that’s the implementation I am looking for, however, it would be beneficial to know if using an address external to the network to add worker nodes via CLI is possible.

Thanks for the response, I found the IP however when starting it with the 192.168.x.x address, it appears to err out and default back to local IP

ray start --head --address=192.168.254.123

Specifying --address for external Redis address is deprecated. Please specify environment variable RAY_REDIS_ADDRESS=192.168.254.123 instead.
Will use 192.168.254.123 as external Redis server address(es). If the primary one is not reachable, we starts new one(s) with --port in local.
The primary external redis server 192.168.254.123 is not reachable. Will starts new one(s) with --port in local.
Local node IP: 127.0.0.1

The --address=... should be passed into ray start on the second node. From your comment, it looks like you used it on the first node?

Yeah, when I don’t pass the --address argument on the first node (head node), it sets it up on a local IP (127.0.0.1) which makes it not work for setting up the second node. At least when calling ray start --head, it tells me to invoke ray start --address 127.0.0.1 on subsequent nodes, which does not work because the other nodes are on the same network but different machines.

Can you share more about the operating system that you are on?

What happens when you run the following? (copied and pasted from Ray)

import errno, socket

def node_ip_address_from_perspective(address):
    """IP address by which the local node can be reached *from* the `address`.

    Args:
        address (str): The IP address and port of any known live service on the
            network you care about.

    Returns:
        The IP address by which the local node can be reached from the address.
    """
    ip_address, port = address.split(":")
    s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    try:
        # This command will raise an exception if there is no internet
        # connection.
        s.connect((ip_address, int(port)))
        node_ip_address = s.getsockname()[0]
    except OSError as e:
        node_ip_address = "127.0.0.1"
        # [Errno 101] Network is unreachable
        if e.errno == errno.ENETUNREACH:
            try:
                # try get node ip address from host name
                host_name = socket.getfqdn(socket.gethostname())
                node_ip_address = socket.gethostbyname(host_name)
            except Exception:
                pass
    finally:
        s.close()

    return node_ip_address

node_ip_address_from_perspective("8.8.8.8:53")

Separately, have you tried running ray start --head on the first machine and then running ray start --address=192.168.254.123:6379 on the second machine (replacing the IP address with the IP address of the first machine if the one above is not correct)? Can you try that?

Thanks, I have tried both actions. For the code snippet, it appeared to work without being caught by the exception. For connecting on the second machine, it will not connect when starting the head on machine 1 and using that command to connect to 192.168.x.x (to the correct address and port for machine 1 on the network).

I am running Windows 10 OS.

So you are creating a cluster of two Windows 10 machines, is that right?

correct and ray always initializes the head node using a local 127.0.0.1 address. And the second machine cannot connect even when using the head node machines’ network 192.168x.x. address. They are on the same network and network sharing is also enabled and functional in windows explorer between the two.

In most all examples of Ray online, ray start --head will say it has initialized the node on a network 192.168.x.x address, so I’m not sure why it’s defaulting to local.

I will also note that the two have the same python environments and I installed them via pip (pip install ray and then pip install ray[default]). Despite doing the pip install ray[default] option when I tried to spin up ray via command line on both machines it told me redis needed to be pip installed (even though other dependencies had been automatically installed when ray was pip installed). So I manually installed redis via pip install redis.

Hi @Chrome,

For Windows (and Mac OSX), we are actually using 127.0.0.1 by default to avoid security pops (Listen to 127.0.0.1 by default on mac osx by jjyao · Pull Request #18904 · ray-project/ray · GitHub). To work around that, you can use ray start --head --node-ip-address=192.168.x.x to override the default ip.

Thank you, this worked. In this case there is not a major security concern correct? Because this address is only accessible via the network? It would only be an issue if the address was external?

So actually I can create the cluster now and verify the connect which shows the machines as active on the dashboard, however, when I try to use Ray in Python via Pool() via from ray.util.multiprocessing import Pool, I am getting an error.

from ray.util.multiprocessing import Pool
_address = “auto”
_address = “192.168.x.x:8888” #This doesn’t work either (‘x’ replaced with actual IP)
Pool(ray_address=_address)

Stdout shows the following error message:
[2022-04-27 04:20:25,784 E 62524 97184] core_worker_process.cc:223: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: .Please see raylet.out for more details.

I believe I got it working but had to use the --node-ip-address on the second machine along with the --address argument to connect it to the head node because the second node was connecting but used a local IP as well making it unfindable when running the pool.

I did notice that when I use the Pool() in python connected to Ray, it does initialize with a warning message that one of the nodes is using local 127.0.0.1 (despite manually assigning both nodes to the 192.168.x.x addresses now). The pool appears to work but the warning does appear.

Could you share the warning message you saw?

Glad it’s working now. We will improve the UX on that (at least make it clear that windows and mac are using localhost by default).