I want to create one big ray cluster, and then whenever I feel like it, I would like to start an rllib training run with a current version of my code. This means that there can be many independent rllib runs at the same time on the same ray cluster. I do not want to use tune though.
Right now, it seems that if I connect more than 2 or 3 clients to the ray cluster (I do 2 or 3 times ray.init(address=...)), I get the error
2020-12-14 12:01:08,014 WARNING services.py:202 -- Some processes that the driver needs to connect to have not registered with Redis,
so retrying. Have you run 'ray start' on this node?
On trying more, I noticed that it was probably just by chance that I got 2 or 3 clients running. Sometimes I don’t even get 1. Another error message that I randomly get:
It’s Ubuntu 18.04. I tried it with several different ray versions (0.8.7, 1.0.1 and nightly). I did do ray start properly. That’s why it works with the first few times I do ray.init(address=xxx).
Hi @sangcho,
I tried to reproduce the issue but this time the error didn’t occur anymore (with any of the above-mentioned versions of ray). I have re-installed ray in the meantime, maybe it’s because of that.
I’ll let you know as soon as I can reproduce it again.
Thanks for looking into this!