Ray status can connect but client code cannot

I am trying to run a ray tune example on a single SLURM worker with multiple GPUs.

My SLURM job launches a ray head and a ray worker, and then runs a python script that tries to launch tune. I cannot get the client code to connect to the cluster BUT a ray status run from that same container does connect and report back.

The client shows the following under the ray logs under /tmp:

root@worker1:/code# cat /tmp/ray/session_latest/logs/python-core-driver-03000000ffffffffffffffffffffffffffffffffffffffffffffffff_99685.log 
[2021-07-27 13:02:13,074 I 99685 99685] core_worker.cc:139: Constructing CoreWorkerProcess. pid: 99685
[2021-07-27 13:02:13,084 I 99685 99685] core_worker.cc:361: Constructing CoreWorker, worker_id: 03000000ffffffffffffffffffffffffffffffffffffffffffffffff
[2021-07-27 13:02:14,085 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 1, num_retries = 10)
[2021-07-27 13:02:15,085 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 2, num_retries = 10)
[2021-07-27 13:02:16,085 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 3, num_retries = 10)
[2021-07-27 13:02:17,086 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 4, num_retries = 10)
[2021-07-27 13:02:18,086 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 5, num_retries = 10)
[2021-07-27 13:02:19,086 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 6, num_retries = 10)
[2021-07-27 13:02:20,087 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 7, num_retries = 10)
[2021-07-27 13:02:21,088 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 8, num_retries = 10)
[2021-07-27 13:02:22,088 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 9, num_retries = 10)
[2021-07-27 13:02:23,088 C 99685 99685] raylet_client.cc:57: Could not connect to socket /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet
[2021-07-27 13:02:23,088 E 99685 99685] logging.cc:441: *** Aborted at 1627390943 (unix time) try "date -d @1627390943" if you are using GNU date ***
[2021-07-27 13:02:23,092 E 99685 99685] logging.cc:441: PC: @                0x0 (unknown)
[2021-07-27 13:02:23,092 E 99685 99685] logging.cc:441: *** SIGABRT (@0x18565) received by PID 99685 (TID 0x7fee4326c740) from PID 99685; stack trace: ***
[2021-07-27 13:02:23,094 E 99685 99685] logging.cc:441:     @     0x7fed83644b8f google::(anonymous namespace)::FailureSignalHandler()
[2021-07-27 13:02:23,097 E 99685 99685] logging.cc:441:     @     0x7fee435dd3c0 (unknown)
[2021-07-27 13:02:23,100 E 99685 99685] logging.cc:441:     @     0x7fee432b518b gsignal
[2021-07-27 13:02:23,103 E 99685 99685] logging.cc:441:     @     0x7fee43294859 abort
[2021-07-27 13:02:23,105 E 99685 99685] logging.cc:441:     @     0x7fed836342ce ray::SpdLogMessage::Flush()
[2021-07-27 13:02:23,108 E 99685 99685] logging.cc:441:     @     0x7fed8363439d ray::RayLog::~RayLog()
[2021-07-27 13:02:23,110 E 99685 99685] logging.cc:441:     @     0x7fed8329b085 ray::raylet::RayletConnection::RayletConnection()
[2021-07-27 13:02:23,112 E 99685 99685] logging.cc:441:     @     0x7fed8329b292 ray::raylet::RayletClient::RayletClient()
[2021-07-27 13:02:23,115 E 99685 99685] logging.cc:441:     @     0x7fed8321ba2c ray::CoreWorker::CoreWorker()
[2021-07-27 13:02:23,118 E 99685 99685] logging.cc:441:     @     0x7fed8321f137 ray::CoreWorkerProcess::CreateWorker()
[2021-07-27 13:02:23,121 E 99685 99685] logging.cc:441:     @     0x7fed83232578 ray::CoreWorkerProcess::CoreWorkerProcess()
[2021-07-27 13:02:23,123 E 99685 99685] logging.cc:441:     @     0x7fed832334ee ray::CoreWorkerProcess::Initialize()
[2021-07-27 13:02:23,125 E 99685 99685] logging.cc:441:     @     0x7fed831155e9 __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
[2021-07-27 13:02:23,126 E 99685 99685] logging.cc:441:     @     0x7fed83116d8f __pyx_tp_new_3ray_7_raylet_CoreWorker()
[2021-07-27 13:02:23,128 E 99685 99685] logging.cc:441:     @     0x560be3699d1c _PyObject_MakeTpCall
[2021-07-27 13:02:23,130 E 99685 99685] logging.cc:441:     @     0x560be373b986 _PyEval_EvalFrameDefault
[2021-07-27 13:02:23,132 E 99685 99685] logging.cc:441:     @     0x560be371d433 _PyEval_EvalCodeWithName
[2021-07-27 13:02:23,134 E 99685 99685] logging.cc:441:     @     0x560be371e818 _PyFunction_Vectorcall
[2021-07-27 13:02:23,137 E 99685 99685] logging.cc:441:     @     0x560be3737eb2 _PyEval_EvalFrameDefault
[2021-07-27 13:02:23,138 E 99685 99685] logging.cc:441:     @     0x560be371d433 _PyEval_EvalCodeWithName
[2021-07-27 13:02:23,140 E 99685 99685] logging.cc:441:     @     0x560be371e818 _PyFunction_Vectorcall
[2021-07-27 13:02:23,142 E 99685 99685] logging.cc:441:     @     0x560be3688b6e PyObject_Call
[2021-07-27 13:02:23,144 E 99685 99685] logging.cc:441:     @     0x560be373884f _PyEval_EvalFrameDefault
[2021-07-27 13:02:23,146 E 99685 99685] logging.cc:441:     @     0x560be371d433 _PyEval_EvalCodeWithName
[2021-07-27 13:02:23,148 E 99685 99685] logging.cc:441:     @     0x560be371e818 _PyFunction_Vectorcall
[2021-07-27 13:02:23,150 E 99685 99685] logging.cc:441:     @     0x560be3737eb2 _PyEval_EvalFrameDefault
[2021-07-27 13:02:23,152 E 99685 99685] logging.cc:441:     @     0x560be371d433 _PyEval_EvalCodeWithName
[2021-07-27 13:02:23,154 E 99685 99685] logging.cc:441:     @     0x560be371e499 PyEval_EvalCodeEx
[2021-07-27 13:02:23,156 E 99685 99685] logging.cc:441:     @     0x560be37b9ecb PyEval_EvalCode
[2021-07-27 13:02:23,156 E 99685 99685] logging.cc:441:     @     0x560be37b9f63 run_eval_code_obj
[2021-07-27 13:02:23,156 E 99685 99685] logging.cc:441:     @     0x560be37d6033 run_mod
[2021-07-27 13:02:23,157 E 99685 99685] logging.cc:441:     @     0x560be37db022 pyrun_file

Yet the ray status looks like this:

root@worker1:/code# ray status --address '10.199.200.100:6379' --redis_password 'fake-c22c-4d26-893c-ca61ae6ed410'
======== Cluster status: 2021-07-27 13:52:23.810234 ========
Node status
------------------------------------------------------------
 1 node(s) with resources: {'GPU': 1.0, 'CPU': 2.0, 'node:10.199.200.100': 1.0, 'accelerator_type:T4': 1.0, 'object_store_memory': 29850713702.0, 'memory': 59701427406.0}

Resources
------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:T4
 0.00/55.601 GiB memory
 0.00/27.801 GiB object_store_memory

Demands:
 (no resource demands)

The relevant part of the client code, which never makes it past the ray.init(, looks like this:

if __name__ == "__main__":
_redis_password=os.environ["redis_password"])
  import logging
  ray.init(
    address='10.199.200.100:6379',
    _redis_password=os.environ["redis_password"],
    log_to_driver=True,
    configure_logging=True,
    logging_level=logging.DEBUG,
    dashboard_host="0.0.0.0"
  )

Please advise…

Do you get any error message at all when calling ray.init (other than the ones in /tmp)?

Did you validate that os.environ["redis_password"] is actually set to the correct value?

No other error messages. I see the ABORTED show up in the SLURM output. If I use a bogus password I immediately get redis.exceptions.ResponseError: WRONGPASS invalid username-password pair

So it hangs on ray.init()? Can you try py-spy --dump [pid] using the pid of the python process calling the ray.init? Then we might get an idea where it hangs (though it will probably be similar as the log output you already posted)

Here it is:

Process 266469: python examples/mnist_pytorch_trainable.py
Python v3.8.10 (/opt/conda/bin/python3.8)

Thread 266469 (idle): "MainThread"
    connect (ray/worker.py:1257)
    init (ray/worker.py:851)
    wrapper (ray/_private/client_mode_hook.py:62)
    <module> (mnist_pytorch_trainable.py:72)

If I drop into a debugger at ray/worker:1257, I see:

(Pdb) pp vars()
{'driver_name': 'examples/mnist_pytorch_trainable.py',
 'driver_object_store_memory': None,
 'error_message': 'Perhaps you called ray.init twice by accident?',
 'gcs_options': <ray._raylet.GcsClientOptions object at 0x7fe254d0ac50>,
 'job_config': <ray.job_config.JobConfig object at 0x7fe254c498b0>,
 'job_id': JobID(08000000),
 'log_stderr_file_path': '',
 'log_stdout_file_path': '',
 'log_to_driver': True,
 'main': <module '__main__' from 'examples/mnist_pytorch_trainable.py'>,
 'mode': 0,
 'namespace': None,
 'node': <ray.node.Node object at 0x7fe254c49610>,
 'o': 0,
 'pdb': <module 'pdb' from '/opt/conda/lib/python3.8/pdb.py'>,
 'redis_address': '10.199.200.100',
 'redis_port': '6379',
 'runtime_env_hash': 0,
 'serialized_job_config': b'\x10\x01*\x04\n\x02{}2$b3842203-e78b-49b8-84ff-29'
                          b'8780c0c689:\x02{}',
 'worker': <ray.worker.Worker object at 0x7fe25d51d580>}
(Pdb) pp vars(node)
{'_config': {},
 '_incremental_dict': defaultdict(<function Node._init_temp.<locals>.<lambda> at 0x7fe254c4be50>, {}),
 '_localhost': '127.0.0.1',
 '_logs_dir': '/tmp/ray/session_2021-07-27_16-56-53_761787_122883/logs',
 '_metrics_export_port': 53028,
 '_node_ip_address': '10.199.200.100',
 '_plasma_store_socket_name': '/tmp/ray/session_2021-07-27_16-56-53_761787_122883/sockets/plasma_store',
 '_ray_params': <ray._private.parameter.RayParams object at 0x7fe254c495e0>,
 '_raylet_ip_address': '10.199.200.100',
 '_raylet_socket_name': '/tmp/ray/session_2021-07-27_16-56-53_761787_122883/sockets/raylet',
 '_redis_address': '10.199.200.100:6379',
 '_resource_dir': '/tmp/ray/session_2021-07-27_16-56-53_761787_122883/runtime_resources',
 '_resource_spec': None,
 '_session_dir': '/tmp/ray/session_2021-07-27_16-56-53_761787_122883',
 '_sockets_dir': '/tmp/ray/session_2021-07-27_16-56-53_761787_122883/sockets',
 '_temp_dir': '/tmp/ray',
 '_webui_url': '127.0.0.1:8265',
 'all_processes': {},
 'backup_count': 5,
 'head': False,
 'kernel_fate_share': False,
 'max_bytes': 536870912,
 'metrics_agent_port': 59078,
 'removal_lock': <unlocked _thread.lock object at 0x7fe254c49690>,
 'session_name': 'session_2021-07-27_16-56-53_761787_122883'}
(Pdb) pp vars(worker)
{'_load_code_from_local': False,
 '_session_index': 0,
 'actors': {},
 'cached_functions_to_run': [],
 'debugger_breakpoint': b'',
 'debugger_get_breakpoint': b'',
 'filter_logs_by_job': True,
 'function_actor_manager': <ray._private.function_manager.FunctionActorManager object at 0x7fe25d51d490>,
 'lock': <unlocked _thread.RLock object owner=0 count=0 at 0x7fe254c49870>,
 'memory_monitor': <ray._private.memory_monitor.MemoryMonitor object at 0x7fe25c5982e0>,
 'mode': 0,
 'node': <ray.node.Node object at 0x7fe254c49610>,
 'original_gpu_ids': ['0'],
 'redis_client': Redis<ConnectionPool<Connection<host=10.199.200.100,port=6379,db=0>>>,
 'serialization_context_map': {},
 'threads_stopped': <threading.Event object at 0x7fe25c598ac0>}

@kai Is there anything else I can provide?

Ok, so it seems that ray status is not using ray.init() to connect to the cluster - rather it just directly connects to Redis and fetches the autoscaler status.

This means that your Redis is reachable. That’s good news.

My main question now is where you’re running the client code. Are you running this on the head node, or is this some local terminal?

ray.init() adds your local machine to the cluster, so this could also be the source of the problems. Could you try using Ray client instead?

I am trying to run a ray tune example on a single SLURM worker with multiple GPUs. That is, everything (the head and worker processes, and the client python script) are all running on the same machine.

Here’s what the Ray client ray.init looks like:

>>> import ray
>>> ray.init("ray://10.199.200.100:10001")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 727, in init
    redis_address, _, _ = services.validate_redis_address(address)
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/services.py", line 362, in validate_redis_address
    redis_address = address_to_ip(address)
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/services.py", line 394, in address_to_ip
    ip_address = socket.gethostbyname(address_parts[0])
socket.gaierror: [Errno -2] Name or service not known

Not positive, but that seems like it might mean it doesn’t like the “ray://” part of that URL.

Just to confirm, there is a tcp6 listener on port 10001.

root@worker1:/workspace# netstat -an | grep 10001
tcp6       0      0 :::10001                :::*                    LISTEN

And it’s reachable:

root@worker1:/workspace# host=10.199.200.100 &&  port=10001 && echo > /dev/tcp/$host/$port &>/dev/null && echo "it's up" || echo "it's down"
it's up

@kai Is there perhaps a separate install needed for the client?

Usually not - I’m really not sure what is happening here. Are there any networking rules/iptables in place on the SLURM cluster?

I’ll cc @thomasdesr who might have an idea here

It’s all on a single node and I confirmed iptables is disabled

[worker1 ~]$ firewall-cmd --state
not running

Thanks…