Hi,
I’m using ray 1.0.0 installed with pip. redis 3.4.1. Python 3.8.5 (all in a conda env)
I have a Ubuntu 20.04.1 LTS server on GCS. I’m connected with a VPN (openvpn) to the server.
I start ray on the server as head. I start my machine as local node.
My VPN client redirects all traffic (all ports) to the server IP through the VPN. The server is pingable and I can reach the dashboard.
Locally, I run ray.init(address=‘auto’, _redis_password=‘XXXXXX’) from a jupyter notebook. The kernel always dies with the message below.
THIS SETUP WAS WORKING LAST FRIDAY.
I have not changed a single line of my config files and startup code. Nor updated any dependencies.
I always get the following message when I try to run ray.init, or even with a simple command such as ray memory (from my local machine)
~$ ray memory
2020-12-21 17:07:11,690 INFO scripts.py:1317 – Connecting to Ray instance at XX.XX.XX.XX:6379.
2020-12-21 17:07:11,691 INFO worker.py:633 – Connecting to existing Ray cluster at address: XX.XX.XX.XX:6379
F1221 17:07:19.591818 7837 7837 service_based_gcs_client.cc:207] Couldn’t reconnect to GCS server. The last attempted GCS server address was XX.XX.XX.XX:41419
*** Check failure stack trace: ***
@ 0x7fbb1900d6ed google::LogMessage::Fail()
@ 0x7fbb1900e84c google::LogMessage::SendToLog()
@ 0x7fbb1900d3c9 google::LogMessage::Flush()
@ 0x7fbb1900d5e1 google::LogMessage::~LogMessage()
@ 0x7fbb18fc4789 ray::RayLog::~RayLog()
@ 0x7fbb18d081ea ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
@ 0x7fbb18d082ef ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
@ 0x7fbb18d08491 ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
@ 0x7fbb18d0a801 ray::gcs::ServiceBasedGcsClient::Connect()
@ 0x7fbb18c8bed6 ray::CoreWorker::CoreWorker()
@ 0x7fbb18c8fc14 ray::CoreWorkerProcess::CreateWorker()
@ 0x7fbb18c90e82 ray::CoreWorkerProcess::CoreWorkerProcess()
@ 0x7fbb18c9184b ray::CoreWorkerProcess::Initialize()
@ 0x7fbb18bcf448 pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit()
@ 0x7fbb18bd0ba5 __pyx_tp_new_3ray_7_raylet_CoreWorker()
@ 0x55948e1fb6cd _PyObject_MakeTpCall
@ 0x55948e282f56 _PyEval_EvalFrameDefault
@ 0x55948e248f9f _PyEval_EvalCodeWithName
@ 0x55948e249943 _PyFunction_Vectorcall.localalias.355
@ 0x55948e1be11a _PyEval_EvalFrameDefault.cold.2790
@ 0x55948e248f9f _PyEval_EvalCodeWithName
@ 0x55948e249943 _PyFunction_Vectorcall.localalias.355
@ 0x55948e1be11a _PyEval_EvalFrameDefault.cold.2790
@ 0x55948e248a92 _PyEval_EvalCodeWithName
@ 0x55948e249943 _PyFunction_Vectorcall.localalias.355
@ 0x55948e1fb041 PyVectorcall_Call
@ 0x55948e28099b _PyEval_EvalFrameDefault
@ 0x55948e248a92 _PyEval_EvalCodeWithName
@ 0x55948e249943 _PyFunction_Vectorcall.localalias.355
@ 0x55948e249e79 method_vectorcall
@ 0x55948e1fb041 PyVectorcall_Call
@ 0x55948e28099b _PyEval_EvalFrameDefault
Aborted (core dumped)
I do not think the issue is called by Ray directly (for this reason I did not post it as a GH issue), but I’d need some guidance, please.
Why is Ray even calling a GCS-specific function from service_based_gcs_client.cc? I am NOT using Ray code to startup a cluster. I am manually configuring my own Linux server and manually adding my machine as a local node. I can’t see why Ray “knows” that I am using GCS as server provider, nor why it should use a GCS-specific function to connect to a (virtually-)local node.
Many thanks in advance