I am experimenting with autoregressive action distributions - I used RLlib examples as a starter kit.
I am trying to solve a dummy environment, which outputs a random number, e.g.
5 and the goal is to provide two numbers, which when added to the observation, outputs a target value (e.g.
10). You can find the complete definition of the environment here.
While working on this topic, I found that AR models work much worse than a naive approach, which is completely counter-intuitive. I started digging deeper and created
FakeTorchMultiCategorical action distribution, which mimics behavior of
TorchMultiCategorical, but instead of accepting concatenated logits of both actions, it accepts internal features of the model, and the logits are calculated and concatenated inside the constructor (see here). I also verified that the algorithm I am using, PPO, isn’t doing anything strange between model inference and action distribution instantiation. So, inside
ppo_torch_policy.py I found:
logits, state = model(train_batch) curr_action_dist = dist_class(logits, model)
which looks fine. I’ve just moved logits calculation from
dist_class, in some sense similar to one of your examples.
This, however, completely breaks the training and the agent can no longer achieve a decent performance. You can see the training plots here. The plots (e.g. losses and entropy) of the
baseline run are much better than the remaining ones. The question is, why there is a difference between the
baseline run and the
Do you have any idea what I am doing wrong?