Running RLlib with Ray Tune on GCP

Hi fellows,

I want to understand how RLlib can be used in larger scaled experiments and therefore try to get an overview of how to run an experiment with RLlib in the cloud - more precisely on GCP.

I found a helpful introduction to deploying a Ray cluster in the cloud under Launching Cloud Clusters in the docs. I set up a new project on GCP and ran the example code and that works fine. Now, I want to understand how to setup larger experiments with RLlib on GCP. And here I am a little stuck as I do not understand, how to best organize my code, so I hope to get here some best practice guidelines from experienced RLlib/Ray Tune users (@kai, @mannyv, @rliaw :fox_face:).


RLlib’s DQN agent example
Let us for an example use vanilla DQN :space_invader: together with Ray Tune running an experiment with two different learning rates.

I set up my cluster with ray up -y cluster.yaml. What now?

  1. How can I submit a tuning job with the DQN agent on an environment writing out into GCS?
  2. Do you also use a .yaml file for the Trainer configuration (I saw such in this file from @sven1977)?
  3. Does any one have a full example from which to learn (i.e. cluster.yaml, config.yaml, etc.)?

Custom example
Now a more custom example. Let us assume the code is distributed among several files:

-- my code 
  \__ 
      |__ my_env.py  (containing the environment definition) 
      |__ my_policy.py (containing the policy definition)
      |__ my_utilities.py (containing utility functions)
      |__ main.py (main script that can be executed) 
  1. How can the code be sent to the head node on the GCP cluster to be executed? Do I need to use something like:
ray rsync_up cluster.yaml 'cluster/my_code' 'local/my_code'

and then

ray exec cluster.yaml 'python main.py'?

Many thanks to everyone who tries to help here. I just want to dive deeper and see how I could really use RLlib in projects :male_detective: . I also hope to produce a starting point for everyone who wants to run RLlib on GCP :raised_hands: .

Alright, in the last weeks I found out how to set up an experiment using DQN on GCP:

RLlib DQN example
Using the example-full.yaml I had to make the following changes:

  1. See issue #3858 in #ray-clusters, where I also posted the solution.
  2. About the .yaml file for the Trainer configuration I have no news yet … less priority.
  3. So the full example is not there because of 2. missing, but the example in issue #3858 should run for anyone who wants to try this out.

At this point many thanks to the @asawari for this awesome work and for providing so many examples: Setting up the cluster and running the scripts runs amazingly smoothly!!

Custom example
The custom example runs similarly and in the way I expected in above:

  1. The code is send to the head node by using ray rsync-up as shown above uploading all necessary files to the cluster.
  2. To run the main.py I used ray exec as shown above and the code ran errorless.

Hope this helps others, who stand at the same point in their projects.