How to make an agent to learn some actions more(earlier) than the others

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
    I wanted to make my ppo agent to learn some actions more(earlier) than others in a multi action env.
    doing so, I created a “freeze” key in my observation Dict, where passing 1 means detach and 0 means not to detach, I have a if in my env that switch freeze every 4000 step,then I wrote some codes in my forward method like this:
        one = torch.ones(tuple(freeze.size()))
        if torch.equal(freeze,one):
          model_out3 = model_out3.detach()

idea is like this picture:
model

first: do you think this method is legit?
second: can I impliment this with Curriculum learning of RLlib? then how?

Hi @hossein836 ,

There are two examples to look at:

Is that enough to get you started?

It might be good to separate layers high layers before the actions though. Are you using a couple of CNN layers and fully connected ones on top? If so, I would intuitively say that it might me a good idea so separate the freezing/non-freezing action heads a layer or two before the final layer.

Cheers

thanks, I looked them. the thing is in the curriculum_learning method as I understand, we can change how env works not model and etc. for e.g get reward for touching a ball for first task and catching the ball for second task. in this example (catching balls) we don’t change model, we are just changing reward shape. however in my example it’s not about env it’s about learning actions so I can’t use curriculum learning. maybe I can handle it in policy and say just consider first 2 actions for a while in your loss function. but how can I do that especially when policy doesn’t connect with env?
I thought maybe I can use self.global_timestep to change learning regime from time to time in customized policy loss function. is that correct?
my first way freeze is also working, so just asking for curiosity :grin:

Hi @hossein836 ,

Sure. I think policy.global_timestep would be a valid way to go here.
The way understand curriculum learning is consistent with the examples here.
What you are trying to do has, as far as I can see, intuitively makes sense together with curriculum learning. Since it’s about having a task at hand and at first trying to optimize for it by only using a subset of the actions, later adding the full set of actions.

Will you freeze the first learned actions later? What’s the environment?

I use a custom env that I must take 3 actions every step. I wanted to first learn first two actions and then three actions together, then you are right again :smile: , I could use curriculum learning this way. Sudo code was then be like:

if task = 1:
  action1 , action2, __ = action
  action3 = randint(0,5)
else:
  action1 , action2, action3 = action

where action is returned tuple into the step method.
note that you should take action3 every step and it’s not arbitrary, the small problem of this method is you are actually considering action3 in calculating gradient. despite it is random, there is a chance that accidentally it get out of complete randomness. imagine first two steps get good rewards and accidentally action3 was 2. so action3 for next step is not totally random. it will become random again if we continue but in some steps it is not totally random. it’s not a big problem though.
but detach ( freeze) was very intuitively for me at the first place and a little safer.

Sure! I think this very much depends on the dynamics of your environment.
If you can imagine a policy learning actions 1 and 2 while being hardened agains variations in action 3, then this might actually work. I’m curious to see results, especially with a comparison to learning all actions at once! Would you post them here if possible and not too much of a hassle? Cheers

Currently I’m changing my model, so it take some time but I will share the information whenever it finished. :+1:

1 Like