TorchMultiCategorical with logits calculated in the constructor

iamhatesz · October 4, 2021, 3:02pm

Hi!

I am experimenting with autoregressive action distributions - I used RLlib examples as a starter kit.

I am trying to solve a dummy environment, which outputs a random number, e.g. 5 and the goal is to provide two numbers, which when added to the observation, outputs a target value (e.g. 10). You can find the complete definition of the environment here.

While working on this topic, I found that AR models work much worse than a naive approach, which is completely counter-intuitive. I started digging deeper and created FakeTorchMultiCategorical action distribution, which mimics behavior of TorchMultiCategorical, but instead of accepting concatenated logits of both actions, it accepts internal features of the model, and the logits are calculated and concatenated inside the constructor (see here). I also verified that the algorithm I am using, PPO, isn’t doing anything strange between model inference and action distribution instantiation. So, inside ppo_torch_policy.py I found:

logits, state = model(train_batch)
curr_action_dist = dist_class(logits, model)

which looks fine. I’ve just moved logits calculation from model to dist_class, in some sense similar to one of your examples.

This, however, completely breaks the training and the agent can no longer achieve a decent performance. You can see the training plots here. The plots (e.g. losses and entropy) of the baseline run are much better than the remaining ones. The question is, why there is a difference between the baseline run and the fake_multicategorical run?

Do you have any idea what I am doing wrong?

iamhatesz · October 5, 2021, 5:36pm

One idea I see is that when we construct:

prev_action_dist = dist_class(train_batch[SampleBatch.ACTION_DIST_INPUTS], model)

after the first SGD iteration, we actually calculating final logits using the updated model, but the old hidden state (features/context). However, this is not used in training - just to log KL metrics:

action_kl = prev_action_dist.kl(curr_action_dist)
mean_kl = reduce_mean_valid(action_kl)
# ...
policy._mean_kl = mean_kl

mannyv · October 6, 2021, 12:49pm

@iamhatesz,

This part seems off:

github.com

iamhatesz/rl-ar/blob/6bd888e1821a43ce81224bfc70610d36eb95c135/rl_ar/models/baseline.py#L59-L60

    
      
          logits_a = self.forward_action_a(features)
          logits_b = self.forward_action_b(features)

You pass the features through the autoregressive action layers in the forward call of the model. Then rllib will end up passing those outputs back through again inside the action distribution.

iamhatesz · October 6, 2021, 12:54pm

No - only the BaselineModel has these passes in its forward method, and this model is supposed to be working with the built-in TorchMultiCategorical distribution. All the remaining models (including one for fake_multicategorical) have forward method overridden and only features/context are being calculated there (see here).

In short words: there is my custom implementation (FakeTorchMultiCategorical) of the built-in TorchMultiCategorical action distribution. The only difference between these two is that the former expects model to output context features and calculates final logits in the constructor, while the latter has logits calculation done in the model. In theory, they should produce exactly the same output. But they don’t, and I can’t understand why.

mannyv · October 6, 2021, 1:18pm

@iamhatesz,

Sorry my fault I misread.

BTW if you kl_coeff is > 0 then the action_kl is used in the loss here:

https://github.com/ray-project/ray/blob/234b015b426274d461a15345a4d4724a08bc5289/rllib/agents/ppo/ppo_torch_policy.py#L104

iamhatesz · October 6, 2021, 2:03pm

Thanks @mannyv, you’re right. By zeroing kl_coeff I have exactly the same plots (however, the performance drastically dropped, but this might be hopefully restored with clip_param :)).

Nevertheless, it would be worth fixing the prev_action_dist, as even the official example probably suffers with this issue. Ideally, there should be the old_model available in ppo_surrogate_loss, or do you have a better idea? I am fine preparing a PR.

mannyv · October 6, 2021, 4:21pm

@iamhatesz

There is an open issue for this but it has not gotten much attention yet.

github.com/ray-project/ray

[rllib] Bug of AutoRegressiveDistribution: the old policy still can have access to current model parameters when using PPO

opened 04:10AM - 15 Jul 21 UTC

yangysc

bug triage rllib

### What is the problem? For example, when initializing a **TorchBinaryAutoregr…essiveDistribution** instance, the `_a1_distribution` and `_a2_distribution` need to call `self.model.action_module`, which is the model parameters defined in policy network. https://github.com/ray-project/ray/blob/86d0159c0af5cc613cf91005d61edff8b761f84b/rllib/examples/models/autoregressive_action_dist.py#L132-L143 When using PPO, we need to calculate the difference between old policy and current policy, https://github.com/ray-project/ray/blob/ac54164e73cf554ea51452ba3fb4ece4f7017623/rllib/agents/ppo/ppo_torch_policy.py#L45-L46 https://github.com/ray-project/ray/blob/ac54164e73cf554ea51452ba3fb4ece4f7017623/rllib/agents/ppo/ppo_torch_policy.py#L67-L68 https://github.com/ray-project/ray/blob/ac54164e73cf554ea51452ba3fb4ece4f7017623/rllib/agents/ppo/ppo_torch_policy.py#L73 And the possible problem is the `prev_action_dist ` **uses the current model instead of the old policy**. So the `kl` between two policies is not accurate. - **How can we backup an old model before the current training `K` epoches?** - A another solution is we save all logits into SampleBatch and use them in calculating PPO loss. - A another solution is to approximate the kl between two distributions using sampled actions `kl(q, p) = r-1 - log r` and `r = p(x)/q(x)` *Ray version and other system information (Python version, TensorFlow version, OS):* - Python 3.8 - Pytorch 1.8 - Ubuntu ### Reproduction (REQUIRED) Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have **no external library dependencies** (i.e., use fake or mock data / environments): If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script". - [X] I have verified my script runs in a clean environment and reproduces the issue. - [X] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

Topic		Replies	Views
RLModule with autoregressive actions RLlib	1	188	December 26, 2023
Scripted Agent Support RLlib	2	300	June 10, 2021
Output from custom policy network for PPO RLlib	1	444	November 15, 2022
Output of PPO with discrete actions RLlib	4	1081	December 15, 2022
Custom Autoregressive Action Models/Distributions RLlib	1	481	December 29, 2020

TorchMultiCategorical with logits calculated in the constructor

Related topics