Connect multiple jobs to same ray cluster

I want to create one big ray cluster, and then whenever I feel like it, I would like to start an rllib training run with a current version of my code. This means that there can be many independent rllib runs at the same time on the same ray cluster. I do not want to use tune though.

Right now, it seems that if I connect more than 2 or 3 clients to the ray cluster (I do 2 or 3 times ray.init(address=...)), I get the error

2020-12-14 12:01:08,014 WARNING -- Some processes that the driver needs to connect to have not registered with Redis,
 so retrying. Have you run 'ray start' on this node?

On trying more, I noticed that it was probably just by chance that I got 2 or 3 clients running. Sometimes I don’t even get 1. Another error message that I randomly get:

$ python -c "import ray; ray.init(address='auto', _redis_password='5241590000000000')"
2020-12-14 12:23:35,196 INFO -- Connecting to existing Ray cluster at address:
F1214 12:23:45.306064 33170 33170] Could not connect to socket /tmp/ray/session_2020-12-14_11-27-25_600210_12536/sockets/raylet.7
*** Check failure stack trace: ***
    @     0x15167b5d6cdd  google::LogMessage::Fail()
    @     0x15167b5dabee  google::LogMessage::SendToLog()
    @     0x15167b5d69ae  google::LogMessage::Flush()
    @     0x15167b5d6bc2  google::LogMessage::~LogMessage()
    @     0x15167b581d99  ray::RayLog::~RayLog()
    @     0x15167b1ee02f  ray::raylet::RayletConnection::RayletConnection()
    @     0x15167b1ee23c  ray::raylet::RayletClient::RayletClient()
    @     0x15167b18d5aa  ray::CoreWorker::CoreWorker()
    @     0x15167b190ee8  ray::CoreWorkerProcess::CreateWorker()
    @     0x15167b1913a0  ray::CoreWorkerProcess::CoreWorkerProcess()
    @     0x15167b1923ae  ray::CoreWorkerProcess::Initialize()
    @     0x15167b0a3ef5  __pyx_pf_3ray_7_raylet_10CoreWorker___cinit__()
    @     0x15167b0a4cd5  __pyx_tp_new_3ray_7_raylet_CoreWorker()
    @     0x5628da2bfb70  _PyObject_FastCallKeywords
    @     0x5628da2c0a79  call_function
    @     0x5628da33374a  _PyEval_EvalFrameDefault
    @     0x5628da272932  _PyEval_EvalCodeWithName
    @     0x5628da2be700  _PyFunction_FastCallKeywords
    @     0x5628da2c08e8  call_function
    @     0x5628da32ff95  _PyEval_EvalFrameDefault
    @     0x5628da272932  _PyEval_EvalCodeWithName
    @     0x5628da2be700  _PyFunction_FastCallKeywords
    @     0x5628da2c08e8  call_function
    @     0x5628da32ff95  _PyEval_EvalFrameDefault
    @     0x5628da272932  _PyEval_EvalCodeWithName
    @     0x5628da273b49  PyEval_EvalCodeEx
    @     0x5628da353c2b  PyEval_EvalCode
    @     0x5628da3c03ff  run_mod
    @     0x5628da3ca27d  PyRun_StringFlags
    @     0x5628da3ca2dd  PyRun_SimpleStringFlags
    @     0x5628da3caea7  pymain_main
    @     0x5628da3cb27c  _Py_UnixMain
Aborted (core dumped)

What’s the OS you are using? Also, did you start ray start?

It’s Ubuntu 18.04. I tried it with several different ray versions (0.8.7, 1.0.1 and nightly). I did do ray start properly. That’s why it works with the first few times I do ray.init(address=xxx).

Is it possible for you to give me the reproducible script? Also, can you create an issue in Ray’s github page? (and post the reproduce script there).

You can cc @rkooo567 to the issue.

Hi @sangcho,
I tried to reproduce the issue but this time the error didn’t occur anymore (with any of the above-mentioned versions of ray). I have re-installed ray in the meantime, maybe it’s because of that.
I’ll let you know as soon as I can reproduce it again.
Thanks for looking into this!

1 Like

Sounds good. Feel free to follow up if you have any issues :)!