Running RLlib with Ray Tune on GCP

Lars_Simon_Zehnder · October 12, 2021, 2:41pm

Hi fellows,

I want to understand how RLlib can be used in larger scaled experiments and therefore try to get an overview of how to run an experiment with RLlib in the cloud - more precisely on GCP.

I found a helpful introduction to deploying a Ray cluster in the cloud under Launching Cloud Clusters in the docs. I set up a new project on GCP and ran the example code and that works fine. Now, I want to understand how to setup larger experiments with RLlib on GCP. And here I am a little stuck as I do not understand, how to best organize my code, so I hope to get here some best practice guidelines from experienced RLlib/Ray Tune users (@kai, @mannyv, @rliaw ).

RLlib’s DQN agent example
Let us for an example use vanilla DQN together with Ray Tune running an experiment with two different learning rates.

I set up my cluster with ray up -y cluster.yaml. What now?

How can I submit a tuning job with the DQN agent on an environment writing out into GCS?
Do you also use a .yaml file for the Trainer configuration (I saw such in this file from @sven1977)?
Does any one have a full example from which to learn (i.e. cluster.yaml, config.yaml, etc.)?

Custom example
Now a more custom example. Let us assume the code is distributed among several files:

-- my code 
  \__ 
      |__ my_env.py  (containing the environment definition) 
      |__ my_policy.py (containing the policy definition)
      |__ my_utilities.py (containing utility functions)
      |__ main.py (main script that can be executed)

How can the code be sent to the head node on the GCP cluster to be executed? Do I need to use something like:

ray rsync_up cluster.yaml 'cluster/my_code' 'local/my_code'

and then

ray exec cluster.yaml 'python main.py'?

Many thanks to everyone who tries to help here. I just want to dive deeper and see how I could really use RLlib in projects . I also hope to produce a starting point for everyone who wants to run RLlib on GCP .

Lars_Simon_Zehnder · October 28, 2021, 1:14pm

Alright, in the last weeks I found out how to set up an experiment using DQN on GCP:

RLlib DQN example
Using the example-full.yaml I had to make the following changes:

See issue #3858 in #ray-clusters, where I also posted the solution.
About the .yaml file for the Trainer configuration I have no news yet … less priority.
So the full example is not there because of 2. missing, but the example in issue #3858 should run for anyone who wants to try this out.

At this point many thanks to the @asawari for this awesome work and for providing so many examples: Setting up the cluster and running the scripts runs amazingly smoothly!!

Custom example
The custom example runs similarly and in the way I expected in above:

The code is send to the head node by using ray rsync-up as shown above uploading all necessary files to the cluster.
To run the main.py I used ray exec as shown above and the code ran errorless.

Hope this helps others, who stand at the same point in their projects.

Topic		Replies	Views
Getting started with RLlib on a private cluster Ray Core	20	2716	April 28, 2021
[Tune + RLlib] Distributed hpo experiments choosing where the experiment mains are Ray Tune	1	292	April 29, 2021
Trying to set up external RL environment and having trouble RLlib	14	1419	September 28, 2021
Issue with Running Experiments with Custom Gym Environment RLlib	4	497	June 13, 2022
How to run tuned example Configure Algorithm, Training, Evaluation, Scaling	2	425	December 1, 2022

Running RLlib with Ray Tune on GCP

Related topics