Switching exploration through action subspaces

Hey team

I am using a MDP in which a discrete action space can be divided into two sub-spaces, based on some parameters. Can I create an exploration class that forces learner do following?

  • in current call for getting action, sample action from first subspace
  • take that action
  • in next call for getting action, sample action from second subspace
  • take that action
  • repeat switching between sub-spaces from call to call

@sven1977

cc: @RickLan , @mannyv , @arturn , @RickDW , @rusu24edward , @gjoliver

Not sure how to do this through exploration. However, you can easily do this by casting it as a multi-agent problem. On the RLlib side, you’ll have two policies, and you’ll alternate which policy you use each step. You can drive this alternation in the simulation by switching which “agent” outputs at each step. In the first step, you output data for policy 1, in the next step policy 2, then back to 1, then 2, and so on until the end. The actions that come in at step will be keyed off the policy id, but you can just take and use the value, and it should run the same as your current set up.

@Saurabh_Arora I was going to suggest the same setup as @rusu24edward described.

You are going to have trouble with the exploration approach during the learning phase because at that point the exploration is not usually used and all the steps are computed in a batch at the same time so you will have to have some custom bookkeeping to know which branch should be trained and which should not for each sample in the batch.

Thanks
@rusu24edward

But, both policies will have same shared action space. So both agents are trained equally towards all actions. Then, why will one agent prefer a specific subspace? Can you share a minimal set of bulleted steps for this idea?

In parallel, I am looking at exploration option. How do I get values from TensorType timestep and List[TensorType] action_distribution.inputs?

def get_exploration_action(
    self,
    *,
    action_distribution: ActionDistribution,
    timestep: Optional[Union[int, TensorType]] = None,
    explore: bool = True
):

@mannyv , you mentioned that exploration is not used in learning phase. RL exploration is supposed to happen while learning. I am confused what you meant.Can you please clarify?

The action space from the first policy should be the first subspace, and the action space from the second policy should be the second subspace.

Here’s some pseudocode for setting this up:

def MySim(MultiAgentEnv):
    def reset(self):
        return {'policy_1': first_obs} # This tells RLlib to generate an action using policy 1

    def step(self, action_dict):
        # action_dict will be keyed of the agent's id
        action = next(iter(action_dict.values())
        # process the action
        if next(iter(action_dict.keys())) == 'policy_1':
            key_off = 'policy_2'
        else
            key_off = 'policy_1'
        return {key_off: next_obs, key_off: reward, key_off: done_status, key_off: info}
        # This tells RLlib to generate an action using whatever policy is key_off

Hi @Saurabh_Arora,

I am not sure exactly which algorithm you are using but in general there are two main phases in a call to algorithm.train().

The first phase is the rollout phase. During this phase, copies of the environment interact with the policy to generate new samples. During this phase the exploration object is used to sample actions. These samples can be deterministic or probabalistic depending on config settings.

The second phase is the policy update phase. During this phase previously collected samples are used to update the policy according to the algorithms loss function. During this phase for most algorithms in rllib actions are not generated and so the exploration object is not used during this step.

Thanks for a clear explanation @mannyv . I am able to implement exploration class that achieves the training I want for learner. Can you please look into my follow up question here

@rusu24edward , if I switch to multi-agent setting, I have to change a lot of code in rest of my codebase that uses learned policy. I think I should first give a shot to exploration. Can you please look into my follow up question here Using different get_exploration_action method pre and post training ?