I have been using RLlib (using PPO) successfully with my custom environment for a while. So far, I ran RLlib locally or by installing and starting different experiments/sessions manually on each VM.
I would now like to run a single session on a cluster across multiple private machines (not public cloud). I tried following the documentation here and the example cluster config here, but it doesn’t work and I’m struggling to understand how it should.
I tried taking and adjusting the example config, saving it to
config.yaml, and then running on the machine that’s supposed to be my cluster head:
ray up cluster.yaml
This initially failed saying that some fields were unexpected and didn’t match the JSON schema (e.g.,
After commenting these lines in the config and running the command again, it now errors with:
2021-02-01 16:09:45,451 INFO command_runner.py:542 -- NodeUpdater: ray01.css.upb.de: Running ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2e970e822e/070dd72385/%C -o ControlPersist=10s -o ConnectTimeout=120s firstname.lastname@example.org bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)' Command 'ray' not found, did you mean: command 'raw' from deb util-linux (2.34-0.1ubuntu9.1) command 'rar' from deb rar (2:5.5.0-1build1) command 'say' from deb gnustep-gui-runtime (0.27.0-5build2) command 'ra6' from deb ipv6toolkit (2.0-1) command 'ra' from deb argus-client (1:220.127.116.11-5ubuntu1) Try: sudo apt install <deb name> Shared connection to 127.0.1.1 closed.
How is this supposed to work? I have
ray installed locally in a virtualenv both on my head node machine (which I called ray01) and on the other machine (ray03). Why does it say,
ray is not installed?
I commented the
docker field in
cluster.yaml since I don’t have
docker installed yet and don’t necessarily want to. Do I have to?
As I understand it, the
ray up cluster.yaml command would just start the cluster, but not run anything on it yet, right? How would I then run my custom environment? Could I just attach to the cluster and run the command that I typically run locally, eg,
myenv --arg1 1 --arg2 2, and it would run it across both machines in the cluster?
Is there any example for running RLlib in a (private) cluster?