Getting started with RLlib on a private cluster

I have been using RLlib (using PPO) successfully with my custom environment for a while. So far, I ran RLlib locally or by installing and starting different experiments/sessions manually on each VM.

I would now like to run a single session on a cluster across multiple private machines (not public cloud). I tried following the documentation here and the example cluster config here, but it doesn’t work and I’m struggling to understand how it should.

I tried taking and adjusting the example config, saving it to config.yaml, and then running on the machine that’s supposed to be my cluster head:

ray up cluster.yaml

This initially failed saying that some fields were unexpected and didn’t match the JSON schema (e.g., rsync_filter).

After commenting these lines in the config and running the command again, it now errors with:

2021-02-01 16:09:45,451 INFO -- NodeUpdater: Running ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2e970e822e/070dd72385/%C -o ControlPersist=10s -o ConnectTimeout=120s stefan@ bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'

Command 'ray' not found, did you mean:

  command 'raw' from deb util-linux (2.34-0.1ubuntu9.1)
  command 'rar' from deb rar (2:5.5.0-1build1)
  command 'say' from deb gnustep-gui-runtime (0.27.0-5build2)
  command 'ra6' from deb ipv6toolkit (2.0-1)
  command 'ra' from deb argus-client (1:

Try: sudo apt install <deb name>

Shared connection to closed.

How is this supposed to work? I have ray installed locally in a virtualenv both on my head node machine (which I called ray01) and on the other machine (ray03). Why does it say, ray is not installed?

I commented the docker field in cluster.yaml since I don’t have docker installed yet and don’t necessarily want to. Do I have to?

As I understand it, the ray up cluster.yaml command would just start the cluster, but not run anything on it yet, right? How would I then run my custom environment? Could I just attach to the cluster and run the command that I typically run locally, eg, myenv --arg1 1 --arg2 2, and it would run it across both machines in the cluster?

Is there any example for running RLlib in a (private) cluster?

Hey @stefanbschneider , thanks for asking this here!
I re-labelled this question “ray core”, b/c I think it’s more of a “starting-a-ray-cluster”-problem.

1 Like

Hey @stefanbschneider is the ray command available on your worker machines by default (i.e. if you naively ssh into them, can you run ray start)?

You mentioned using a conda environment, so I’m wondering if the issue is that the conda environment isn’t activated.

No, it’s not. The ray command is only available inside the virtualenv that I installed on each machine.

I ran the ray up cluster.yaml command from the head node within the activated virtualenv, so the ray command is available there. But I guess the problem is that, when ssh into the worker machine, it is not natively available?

Any way to specify the virtualenv to use on the worker machines inside cluster.yaml? Or do I really have to install ray system-wide with sudo pip install ray? Or is using the Docker command an alternative here (how would that work with a custom env?)?

Sorry, I’m a bit lost here. Thanks for the help!

No worries. So the fundamental issue is that the ray program is not on your $PATH, so it can’t be found. You can solve this in a few ways based on your personal preference.

  1. Before each of your setup and start commands first run source <path to venv>/bin/activate; <your original command here>

  2. Just manually specify the full path to the ray executable. You can find this location by manually sourcing your venv then running which ray. Once you know the path, just replace every ray call (should just be ray start and ray stop) with the full path.

  3. Add source <venv>/bin/activate to your bashrc file to automatically source it. Note that this will now happen every time you log into that machine as that user, but it’s by far the most convenient.

1 Like

@Alex Thanks for the help! I’ve finally gotten around to testing this again.

Activating the virtualenv withing .bashrc on both the head node and worker node (did I even need the worker nodes too?) solves the problem and the cluster starts as it should.

I still have a few follow-up questions/issues:

  • I run ray up cluster.yaml on my head node (after connecting via ssh), which starts the cluster. Is there any (simple) way to access the Ray Dashboard on my local laptop, which is not part of the cluster? It seems the dashboard runs on localhost and I couldn’t find an option to change this (to
  • What’s the Docker configuration in the cluster.yaml for? Does that allow running ray directly from a Docker container that’s pulled and started automatically on new nodes? Can I ignore and disable this if I ensure that ray is installed (and activated) on each of my worker nodes?
  • How do I now start training on my custom RLlib environment? Typically, I ssh into the machine and run myenv --arg1 2 --arg2 5. Can I now attach from my laptop to the cluster with ray attach cluster.yaml and run the same command and it will be executed automatically on all nodes of the cluster? What about generated result files; how do I collect them after training/testing?

Thanks again for the help!

When trying to attach from my local laptop with ray attach cluster.yaml, I get RuntimeError: Head node of cluster (ray) not found! (Similar to this issue.) Why is that? I can ssh from my laptop to the head node.

Or am I supposed to run ray attach cluster.yaml on the head node itself? If I do that, I don’t get an error but I also have no clue if it works. The console still shows the head nodes hostname; before and after attach.

If I then run myenv --arg1 2 ... on the (attached?) head node, the program runs as usual. Do I have any option of detaching again? Or verifying if the command is really running on the whole cluster? According to htop, a few things are running on the worker VM but I don’t think it’s any of the training. CPU isn’t utilized at all (but memory is).

Another potential problem is that I set the number of workers manually in my RLlib env. My understand was that I can use this to tune on how many CPU cores to run the training. But in the description of Ray cluster, the recommendation is to use one worker per machine, not per core.

Note: you don’t have to run ray up from the head node of the cluster, most people run it from their laptop. On that note, this is also how people typically access the dashboard. From your laptop, you can run ray dashboard cluster.yaml which will setup the port forwarding so the dashboard appears on localhost:8265 on your laptop.

Yup, the docker setup is if you want to use docker. I believe we have it off by default, but you can explicitly specify docker: {} if you want to be extra sure (if you don’t see it pulling a giant docker image, it’s probably off though).

To run things, you can either ray attach or ray exec to run a single command. You can grab the results via ray rsync-down.

Can you try running ray up from your laptop and see if you still have the issue? Local ray clusters are a little finicky because everyone’s network setup is slightly different.

Unfortunately, there’s no detached mode for executing things. You best bet is probably to execute things inside a tmux session (you can probably do some bash magic to get this to work).

1 Like

Ah, thanks for the hint! I thought I had to run ray up cluster.yaml on the cluster’s head node.

Running it locally on my laptop (from where I can ssh into all cluster nodes), ray up cluster.yaml seems to work fine, but I still cannot see the dashboard at localhost:8265, even though that should be the right port according to the logged messages when running ray up. Any idea what could be the issue here?
I’m running ray up inside WSL on my Windows laptop, but I don’t think this should be the issue since localhost in WSL should be accessible inside Windows (and is when testing it with other apps). Trying to run ray up directly on Windows gets stuck at 2021-02-09 18:05:02,656 INFO -- Using cached config at C:\Users\Stefan\AppData\Local\Temp\ray-config-b7989cc5fb42c62911ecb459b35cb58378ee13bb for some reason.

Attaching from the local laptop to the cluster with ray attach cluster.yaml does work. It opens a terminal on the cluster head node and I can run my custom environment as usual with myenv --arg1 ....
I’m not 100% sure, but it seems like this performs training on the cluster as it should. htop shows that, on the head node, ray::RolloutWorker is active and, on the worker node, ray::PPO:train() is running. So only the worker node performs training and the head node manages the execution but doesn’t train itself, right?

I typically monitor training progress with TensorBoard, running the tensorboard command locally on the node that’s currently training. I guess the best way to do this when training on the cluster is just running the same command on the cluster’s head node?

One last thing: I’m still unsure how to set RLlib’s config["num_workers"] when running on a cluster.
If I don’t set this explicitly and leave the default, htop and the printed logs show that only a small fraction of the cluster’s resources are used: Resources requested: 3/44 CPUs, 0/0 GPUs, 0.0/136.43 GiB heap, 0.0/42.77 GiB objects.
Is there a way to tell RLlib to use all available resources on the cluster to train as fast as possible? @sven1977 Maybe that’s more a question to you?
Otherwise, I’ll just have to determine the number of cores in the cluster and set the value accordingly. That also works since I don’t need auto scaling.

For some reason, I can’t even get the cluster up and running properly anymore. It seems like the cluster only consists of my cluster head and the worker nodes are not recognized.
They are not shown on the dashboard. And when running RLlib via ray attach cluster.yaml, it is only executed on the head node - not on the workers.
RLlib even only prints Resources requested: ../22 CPUs, ..., which are only the CPUs of the head not, ie, the worker CPUs are missing.

I have no clue why this is happening/not working. I have ray installed and accessible on all nodes. The head node can ssh into all worker nodes.
I did update to ray 1.2.0, but I didn’t/don’t expect this to cause any issues.

I tried documenting all my steps so far in a blog post: Scaling Deep Reinforcement Learning to Training on a Private Cluster | Stefan’s Blog
Once I figure out how to successfully run RLlib with a custom env on a private cluster, I’ll update the post. Then it’ll hopefully be useful also to others.

Hmmm can you share any relevant logs from /tmp/ray/session_latest/logs/monitor.* from the head node?

Sure. Seems like the nodes can’t reach each other:

 (no resource demands)
2021-03-01 22:16:59,961 ERROR -- Could not reach the head node.
Traceback (most recent call last):
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/ray/autoscaler/_private/", line 281, in _infer_legacy_node_resources_if_needed
IndexError: list index out of range
2021-03-01 22:16:59,964 INFO --

Not sure what that means. This error is repeatedly in the logs. Why would the logs on the head node say “Could not reach head node”?

Later, at the end of the logs, there’s another error:

Did not find any active Ray processes.
2021-03-01 22:08:58,531 VINFO -- Running `export RAY_HEAD_IP=; ray start --address=$RAY_HEAD_IP:6379`
2021-03-01 22:08:58,531 VVINFO -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2e970e822e/6c9d58deea/%C -o ControlPersist=10s -o ConnectTimeout=120s stefan@ bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_HEAD_IP=; ray start --address=$RAY_HEAD_IP:6379)'`
Traceback (most recent call last):
  File "/home/stefan/DeepCoMP/venv/bin/ray", line 8, in <module>
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/ray/scripts/", line 1519, in main
    return cli()
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/click/", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/click/", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/click/", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/click/", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/click/", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/ray/scripts/", line 637, in start
  File "/home/stefan/DeepCoMP/venv/lib/python3.8/site-packages/ray/_private/", line 734, in check_version_info
    raise RuntimeError(error_message)
RuntimeError: Version mismatch: The cluster was started with:
    Ray: 1.2.0
    Python: 3.8.6
This process on node was started with:
    Ray: 1.2.0
    Python: 3.8.5

2021-03-01 22:08:58,895 INFO -- NodeUpdater: Ray start commands failed [LogTimer=760ms]
2021-03-01 22:08:58,895 INFO -- NodeUpdater: Applied config 14d9b98e9d30570a7eed1f7c122db44a359b6626  [LogTimer=1416ms]
2021-03-01 22:08:58,897 ERR -- New status: update-failed
2021-03-01 22:08:58,897 ERR -- !!!
2021-03-01 22:08:58,897 VERR -- {'message': 'SSH command failed.'}
2021-03-01 22:08:58,897 ERR -- SSH command failed.
2021-03-01 22:08:58,897 ERR -- !!!

So it’s a problem if my head and worker nodes have slightly different Python versions, here 3.8.5 and 3.8.6? Is the SSH command failed error at the end just a consequence of the different Python versions or is that something else?
The different Python versions should be simple to fix. Thanks for pointing me to the logs. I wasn’t sure how to debug this.

I am testing this with two local clusters. On the other local/private cluster, the logs show first some cleanup exception (repeated many times) and then

2021-03-02 19:04:15,541 INFO -- Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
2021-03-02 19:04:15,541 INFO -- 2 random worker nodes will not be shut down. (due to --keep-min-workers)

Also repeated many times.

On both clusters, the problem is the same. The worker nodes are not recognized and training is only done on the head node.

The first error is definitely weird. @Ameer_Haj_Ali do you have any insights?

The second error is because we are a little over cautious with cloud pickle. Any chance you’re able to get the python versions to line up exactly (maybe with conda or something)? The version mismatch should also explain why the worker nodes aren’t able to connect.

@Dmitri, Can you please take a look?

Not a particularly useful insight, but the “cannot reach head node” error suggests an internal failure of LocalNodeProvider.

The autoscaler’s internal representation of the collection of nodes in the cluster is wrong - it looks for a head node and gets nothing.

Any idea why the representation could be wrong? Do I have an error in my cluster config?

It’s most likely an issue on our side – we’ll try to look into it soon.

1 Like

Any chance you’re able to get the python versions to line up exactly (maybe with conda or something)?

I tried it on a different cluster with exactly the same Python and ray versions. It’s a bit strange. I got it to work in one constellation of a head node with one worker node. In this case, both head node and worker were displayed in the Ray Dashboard, the CPUs of the worker were recognized and displayed when starting training, if I selected enough RLlib workers (=CPUs), training was distributed on both head and worker.

Unfortunately, this only worked in one constellation (VM 2 and 3), it didn’t with VM 1 and 2.
It also broke when I tried to add another VM as second worker (ie, VM 2 head, VM 3 and 4 workers) by adding the IP of the extra worker and setting min_workers = max_workers = 2 in the cluster config.
After adding the second worker and restarting the cluster, only the cluster head was shown and recognized and none of the workers… No idea why.

I listed some errors I commonly came across here (mostly unresolved; not sure if they are all relevant): Scaling Deep Reinforcement Learning to a Private Cluster | Stefan’s Blog

Hello, I have the same problem as you, I want to configure .yaml to achieve disk sharing in a cluster, but .yaml cannot start the worker nodes.

I’m curious about the content of your blog, but the link doesn’t open