I want to understand how RLlib can be used in larger scaled experiments and therefore try to get an overview of how to run an experiment with RLlib in the cloud - more precisely on GCP.
I found a helpful introduction to deploying a Ray cluster in the cloud under Launching Cloud Clusters in the docs. I set up a new project on GCP and ran the example code and that works fine. Now, I want to understand how to setup larger experiments with RLlib on GCP. And here I am a little stuck as I do not understand, how to best organize my code, so I hope to get here some best practice guidelines from experienced RLlib/Ray Tune users (@kai, @mannyv, @rliaw ).
RLlib’s DQN agent example
Let us for an example use vanilla DQN together with Ray Tune running an experiment with two different learning rates.
I set up my cluster with
ray up -y cluster.yaml. What now?
- How can I submit a tuning job with the DQN agent on an environment writing out into GCS?
- Do you also use a
.yamlfile for the Trainer configuration (I saw such in this file from @sven1977)?
- Does any one have a full example from which to learn (i.e. cluster.yaml, config.yaml, etc.)?
Now a more custom example. Let us assume the code is distributed among several files:
-- my code \__ |__ my_env.py (containing the environment definition) |__ my_policy.py (containing the policy definition) |__ my_utilities.py (containing utility functions) |__ main.py (main script that can be executed)
- How can the code be sent to the head node on the GCP cluster to be executed? Do I need to use something like:
ray rsync_up cluster.yaml 'cluster/my_code' 'local/my_code'
ray exec cluster.yaml 'python main.py'?
Many thanks to everyone who tries to help here. I just want to dive deeper and see how I could really use RLlib in projects . I also hope to produce a starting point for everyone who wants to run RLlib on GCP .