I have been using RLlib (using PPO) successfully with my custom environment for a while. So far, I ran RLlib locally or by installing and starting different experiments/sessions manually on each VM.
I would now like to run a single session on a cluster across multiple private machines (not public cloud). I tried following the documentation here and the example cluster config here, but it doesn’t work and I’m struggling to understand how it should.
I tried taking and adjusting the example config, saving it to config.yaml
, and then running on the machine that’s supposed to be my cluster head:
ray up cluster.yaml
This initially failed saying that some fields were unexpected and didn’t match the JSON schema (e.g., rsync_filter
).
After commenting these lines in the config and running the command again, it now errors with:
2021-02-01 16:09:45,451 INFO command_runner.py:542 -- NodeUpdater: ray01.css.upb.de: Running ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2e970e822e/070dd72385/%C -o ControlPersist=10s -o ConnectTimeout=120s stefan@127.0.1.1 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'
Command 'ray' not found, did you mean:
command 'raw' from deb util-linux (2.34-0.1ubuntu9.1)
command 'rar' from deb rar (2:5.5.0-1build1)
command 'say' from deb gnustep-gui-runtime (0.27.0-5build2)
command 'ra6' from deb ipv6toolkit (2.0-1)
command 'ra' from deb argus-client (1:3.0.8.2-5ubuntu1)
Try: sudo apt install <deb name>
Shared connection to 127.0.1.1 closed.
How is this supposed to work? I have ray
installed locally in a virtualenv both on my head node machine (which I called ray01) and on the other machine (ray03). Why does it say, ray
is not installed?
I commented the docker
field in cluster.yaml
since I don’t have docker
installed yet and don’t necessarily want to. Do I have to?
As I understand it, the ray up cluster.yaml
command would just start the cluster, but not run anything on it yet, right? How would I then run my custom environment? Could I just attach to the cluster and run the command that I typically run locally, eg, myenv --arg1 1 --arg2 2
, and it would run it across both machines in the cluster?
Is there any example for running RLlib in a (private) cluster?