Trying to set up external RL environment and having trouble


My thesis project involves using an RL policy to manage a hyperparameter tuning setup. This is kind of in reverse to what people usually do, which is use Ray Tune to tune RLLib hyperparameters. At this point I’m quite familiar with Ray Tune, but I’m having trouble figuring out how to integrate RLLib. I’ve spent a long time reading the documentation and some source code but I’m still very confused.

What I’ve tried to do already is implement a Ray Tune Scheduler which takes the policy, action space, and observation space (in this case the action space is the set of hyperparameter values) and works similar to the implementations of PBT or PB2. However, it seems like that isn’t how I should be doing this.

Now, I’m trying to implement an ExternalEnv which contains the state-action space, where the run() method would contain all of my training/tuning code, and then the environment is passed into the Scheduler, which will internally call start_episode(), get_action(), etc. However I’m confused with where the Policy and Trainer fit into this. I’m assuming the environment is passed into the Trainer, but what about the Policy?

I have a dummy Policy implemented which just changes the hyperparameters based on simple logic but I don’t know where to fit it in.

I’m very confused and if someone could help me out that would be great.

Just to give some extra info, here’s a pseudocode overview of what I’m trying to do:

# Ray Tune code

class RLScheduler(FIFOScheduler):
    def __init__(env: TuningEnv, ...):

    def on_trial_result(self, trial, result, ...):
        action = self.env.get_action(result)
        self._assign_new_hparams(trial, action)

# RLLib code
class TuningEnv(ExternalEnv):

    def __init__(self, config, experiment, ...):
        action_space = spaces.Dict(...)
        observation_space = spaces.Dict(...)
        self._experiment = experiment
        # policy????

        super().__init__(action_space=action_space, observation_space=observation_space,

    def run(self):
        run_args = self._experiment.run_args()**asdict(run_args))

Also, if I want to do multiple concurrent tuning trials, would I want to use a multi-agent env?

I’m working on a project that is sort of similar to this, although I didn’t end up using the ExternalEnv API so I’m not an expert. The way to implement this if you want to use an ExternalEnv is to use a PolicyClient and PolicyServerInput. This (RLlib Environments — Ray v1.6.0) points to a simple example that uses them.

(The server script: ray/ at master · ray-project/ray · GitHub and the client script: ray/ at master · ray-project/ray · GitHub)

The basic idea here is that you need to run the two scripts at the same time, which in your case translates to having to run a script that trains your RL agent, and a script that runs your tune experiments for which you want to optimize the hyperparams. The server script from the example can mostly be left as-is, the client script is where you need to make most of the changes.

You want to make the client script run your hypertuning code, and you’ll want to make a custom scheduler like you said. The PolicyClient needs to be wrapped by that scheduler, so you can report the hypertuning results to the server script, and also so you can easily poll your agent for actions (i.e. new hyperparams to train on)

There are quite a few moving parts in this setup so I might have missed something, but I hope that helps. If not, feel free post a follow up :slight_smile:


@import-antigravity the docs are a little confusing, it doesn’t look like this approach uses the ExternalEnv. It should work just fine though.


Hey @import-antigravity , thanks for the question and @RickDW , thanks for your answer :slight_smile: This is great. Yeah, you don’t have to explicitly use ExternalEnv. Your env will be wrapped as one automatically. Just using the PolicyClient on the client side and setting up your input on your server side will be enough. RLlib automatically generates a dummy-RandomEnv on the server side to fix the observation and action spaces (which you have to define in your config).

Basically, this is the important bit here (server side):

    # `InputReader` generator (returns None if no input reader is needed on
    # the respective worker).
    def _input(ioctx):
        # We are remote worker or we are local worker with num_workers=0:
        # Create a PolicyServerInput.
        if ioctx.worker_index > 0 or ioctx.worker.num_workers == 0:
            return PolicyServerInput(
                ioctx, SERVER_ADDRESS, SERVER_BASE_PORT + ioctx.worker_index -
                (1 if ioctx.worker_index > 0 else 0))
        # No InputReader (PolicyServerInput) needed.
            return None

    # Trainer config. Note that this config is sent to the client only in case
    # the client needs to create its own policy copy for local inference.
    config = {
        # Indicate that the Trainer we setup here doesn't need an actual env.
        # Allow spaces to be determined by user (see below).
        "env": None,

        # TODO: (sven) make these settings unnecessary and get the information
        #  about the env spaces from the client.
        "observation_space": gym.spaces.Box(
            float("-inf"), float("inf"), (4, )),
        "action_space": gym.spaces.Discrete(2),

        # Use the `PolicyServerInput` to generate experiences.
        "input": _input,
        # Use n worker processes to listen on different ports.
        "num_workers": args.num_workers,
        # Disable OPE, since the rollouts are coming from online clients.
        "input_evaluation": [],
        # Create a "chatty" client/server or not.
        "callbacks": MyCallbacks if args.chatty_callbacks else None,

The example scripts are located in rllib/examples/serving/cartpole_client|

1 Like

Thanks for posting this! Sorry for not responding sooner, I had sort of given up hope that someone would respond :sweat_smile:

Is it required to do the server-client setup? Based on this figure from the documentation, it seems like what I would want to do is the second image, because the “environment” is itself a Ray job:

I’m honestly not sure, maybe @sven1977 can answer your question. My intuition tells me that there isn’t a big difference between using an externalenv explicitly and using a policy client / policy server input approach. While your tune process is probably run on the same ray cluster as your RL training process, it looks like all communication between the two is neatly handled by the policy client. The only thing that might be interesting for you to look into is the requested resources for the RL and hypertuning processes, so they can run at the same time.

1 Like

Yeah, it looks like this is probably easier because I can also leave my hyperparameter tuning code as-is and simply write the RL training code for the server side. How would I run both at once? Would I treat it like a regular Ray actor and do policy_server().remote() and then run the client code?

No you don’t need to execute any calls remotely, what you need to do is run the client and server scripts at the same time. You might want to give the example scripts another look, everything is in there since you can reuse most of the code as-is :slight_smile:

If you run them at the same time, like with python & python then how will they connect to the same ray runtime? I looked at the example scripts and couldn’t tell how that was supposed to work

I still don’t understand what you mean by “run them at the same time,” could you clarify please? Thanks

Hey, sorry for taking a while to reply. All you need to do is run the two separate python scripts. I’m fairly certain that is the only thing you need to do, but the best thing to do is to just try this and see what happens :smile:

No worries. I guess the confusion is that I’m trying to run both of these on the same SLURM script, so you can’t run two blocking python scripts at the same time. Is there a way to like, run the server script on the head node and run the client scripts on the worker nodes?

I only know about SLURM from reading some documentation, I don’t have any experience with it at all. That being said, I remember reading somewhere that you can run different scripts on different SLURM nodes in a job allocation. So it should be possible, but you should ask someone else if you need help with it.

@import-antigravity Take a look at my post here, which includes an example slurm script. You can use the same slurm script to launch the trainer and the worker nodes. Please read the full post because the original slurm script is not correct and I had to make some changes based on the discussion I had with Sven.

1 Like