I am trying to run a ray tune example on a single SLURM worker with multiple GPUs.
My SLURM job launches a ray head and a ray worker, and then runs a python script that tries to launch tune. I cannot get the client code to connect to the cluster BUT a ray status
run from that same container does connect and report back.
The client shows the following under the ray logs under /tmp:
root@worker1:/code# cat /tmp/ray/session_latest/logs/python-core-driver-03000000ffffffffffffffffffffffffffffffffffffffffffffffff_99685.log
[2021-07-27 13:02:13,074 I 99685 99685] core_worker.cc:139: Constructing CoreWorkerProcess. pid: 99685
[2021-07-27 13:02:13,084 I 99685 99685] core_worker.cc:361: Constructing CoreWorker, worker_id: 03000000ffffffffffffffffffffffffffffffffffffffffffffffff
[2021-07-27 13:02:14,085 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 1, num_retries = 10)
[2021-07-27 13:02:15,085 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 2, num_retries = 10)
[2021-07-27 13:02:16,085 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 3, num_retries = 10)
[2021-07-27 13:02:17,086 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 4, num_retries = 10)
[2021-07-27 13:02:18,086 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 5, num_retries = 10)
[2021-07-27 13:02:19,086 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 6, num_retries = 10)
[2021-07-27 13:02:20,087 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 7, num_retries = 10)
[2021-07-27 13:02:21,088 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 8, num_retries = 10)
[2021-07-27 13:02:22,088 I 99685 99685] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet (num_attempts = 9, num_retries = 10)
[2021-07-27 13:02:23,088 C 99685 99685] raylet_client.cc:57: Could not connect to socket /tmp/ray/session_2021-07-26_21-16-40_313157_7330/sockets/raylet
[2021-07-27 13:02:23,088 E 99685 99685] logging.cc:441: *** Aborted at 1627390943 (unix time) try "date -d @1627390943" if you are using GNU date ***
[2021-07-27 13:02:23,092 E 99685 99685] logging.cc:441: PC: @ 0x0 (unknown)
[2021-07-27 13:02:23,092 E 99685 99685] logging.cc:441: *** SIGABRT (@0x18565) received by PID 99685 (TID 0x7fee4326c740) from PID 99685; stack trace: ***
[2021-07-27 13:02:23,094 E 99685 99685] logging.cc:441: @ 0x7fed83644b8f google::(anonymous namespace)::FailureSignalHandler()
[2021-07-27 13:02:23,097 E 99685 99685] logging.cc:441: @ 0x7fee435dd3c0 (unknown)
[2021-07-27 13:02:23,100 E 99685 99685] logging.cc:441: @ 0x7fee432b518b gsignal
[2021-07-27 13:02:23,103 E 99685 99685] logging.cc:441: @ 0x7fee43294859 abort
[2021-07-27 13:02:23,105 E 99685 99685] logging.cc:441: @ 0x7fed836342ce ray::SpdLogMessage::Flush()
[2021-07-27 13:02:23,108 E 99685 99685] logging.cc:441: @ 0x7fed8363439d ray::RayLog::~RayLog()
[2021-07-27 13:02:23,110 E 99685 99685] logging.cc:441: @ 0x7fed8329b085 ray::raylet::RayletConnection::RayletConnection()
[2021-07-27 13:02:23,112 E 99685 99685] logging.cc:441: @ 0x7fed8329b292 ray::raylet::RayletClient::RayletClient()
[2021-07-27 13:02:23,115 E 99685 99685] logging.cc:441: @ 0x7fed8321ba2c ray::CoreWorker::CoreWorker()
[2021-07-27 13:02:23,118 E 99685 99685] logging.cc:441: @ 0x7fed8321f137 ray::CoreWorkerProcess::CreateWorker()
[2021-07-27 13:02:23,121 E 99685 99685] logging.cc:441: @ 0x7fed83232578 ray::CoreWorkerProcess::CoreWorkerProcess()
[2021-07-27 13:02:23,123 E 99685 99685] logging.cc:441: @ 0x7fed832334ee ray::CoreWorkerProcess::Initialize()
[2021-07-27 13:02:23,125 E 99685 99685] logging.cc:441: @ 0x7fed831155e9 __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
[2021-07-27 13:02:23,126 E 99685 99685] logging.cc:441: @ 0x7fed83116d8f __pyx_tp_new_3ray_7_raylet_CoreWorker()
[2021-07-27 13:02:23,128 E 99685 99685] logging.cc:441: @ 0x560be3699d1c _PyObject_MakeTpCall
[2021-07-27 13:02:23,130 E 99685 99685] logging.cc:441: @ 0x560be373b986 _PyEval_EvalFrameDefault
[2021-07-27 13:02:23,132 E 99685 99685] logging.cc:441: @ 0x560be371d433 _PyEval_EvalCodeWithName
[2021-07-27 13:02:23,134 E 99685 99685] logging.cc:441: @ 0x560be371e818 _PyFunction_Vectorcall
[2021-07-27 13:02:23,137 E 99685 99685] logging.cc:441: @ 0x560be3737eb2 _PyEval_EvalFrameDefault
[2021-07-27 13:02:23,138 E 99685 99685] logging.cc:441: @ 0x560be371d433 _PyEval_EvalCodeWithName
[2021-07-27 13:02:23,140 E 99685 99685] logging.cc:441: @ 0x560be371e818 _PyFunction_Vectorcall
[2021-07-27 13:02:23,142 E 99685 99685] logging.cc:441: @ 0x560be3688b6e PyObject_Call
[2021-07-27 13:02:23,144 E 99685 99685] logging.cc:441: @ 0x560be373884f _PyEval_EvalFrameDefault
[2021-07-27 13:02:23,146 E 99685 99685] logging.cc:441: @ 0x560be371d433 _PyEval_EvalCodeWithName
[2021-07-27 13:02:23,148 E 99685 99685] logging.cc:441: @ 0x560be371e818 _PyFunction_Vectorcall
[2021-07-27 13:02:23,150 E 99685 99685] logging.cc:441: @ 0x560be3737eb2 _PyEval_EvalFrameDefault
[2021-07-27 13:02:23,152 E 99685 99685] logging.cc:441: @ 0x560be371d433 _PyEval_EvalCodeWithName
[2021-07-27 13:02:23,154 E 99685 99685] logging.cc:441: @ 0x560be371e499 PyEval_EvalCodeEx
[2021-07-27 13:02:23,156 E 99685 99685] logging.cc:441: @ 0x560be37b9ecb PyEval_EvalCode
[2021-07-27 13:02:23,156 E 99685 99685] logging.cc:441: @ 0x560be37b9f63 run_eval_code_obj
[2021-07-27 13:02:23,156 E 99685 99685] logging.cc:441: @ 0x560be37d6033 run_mod
[2021-07-27 13:02:23,157 E 99685 99685] logging.cc:441: @ 0x560be37db022 pyrun_file
Yet the ray status
looks like this:
root@worker1:/code# ray status --address '10.199.200.100:6379' --redis_password 'fake-c22c-4d26-893c-ca61ae6ed410'
======== Cluster status: 2021-07-27 13:52:23.810234 ========
Node status
------------------------------------------------------------
1 node(s) with resources: {'GPU': 1.0, 'CPU': 2.0, 'node:10.199.200.100': 1.0, 'accelerator_type:T4': 1.0, 'object_store_memory': 29850713702.0, 'memory': 59701427406.0}
Resources
------------------------------------------------------------
Usage:
0.0/2.0 CPU
0.0/1.0 GPU
0.0/1.0 accelerator_type:T4
0.00/55.601 GiB memory
0.00/27.801 GiB object_store_memory
Demands:
(no resource demands)
The relevant part of the client code, which never makes it past the ray.init(
, looks like this:
if __name__ == "__main__":
_redis_password=os.environ["redis_password"])
import logging
ray.init(
address='10.199.200.100:6379',
_redis_password=os.environ["redis_password"],
log_to_driver=True,
configure_logging=True,
logging_level=logging.DEBUG,
dashboard_host="0.0.0.0"
)
Please advise…