Jump-Start Reinforcement Learning

Samuel_Fipps · January 9, 2025, 2:09pm

So I enabled the 2 options instead of doing them myself

{"log_std_clip_param": 1}
{"free_log_std": True}

It didn’t seem to slow down the training this time (as in how long it takes to process the data), but my model just didn’t learn anything now.

update: I commented out “{“free_log_std”: True}” for now, and set this {“log_std_clip_param”: 1} to 20 and if I still have NaN’s I can work down from there.

tlaurie99 · January 9, 2025, 11:18pm

Hey @Samuel_Fipps, that seems to be a good strategy. Setting {"free_log_std": True} will have the log_stds as a parameter of the model and just like you said I found that the agent normally doesn’t learn anything. As you go down in the log_std_clip_param you’ll find a spot, hopefully, where the agent learns but no NaNs appear. What sort of an action space are you working with? is it normalized?

best,

Tyler

mannyv · January 10, 2025, 2:29pm

@tlaurie99 , @Samuel_Fipps

This is interesting to know. This is the way cleanrl and sb3 handle it. I wonder what is different in their implementation. Do either of you have an setup you could share with me. I would like to experiment with a few things but I don’t have an problem handy that is experiencing this issue.

I have done some experimenting with hand-coded policies designed to produce the nans and I can confidently say that a log_std value less than -25 will produce NaNs in the backward pass. I set the log_std_clip_param to -12 and even this did not prevent NaNs.

One way I did manage to prevent nan was to make the following modification. Depending on if you are using the rl_modules or not:

Without rl_module:

github.com

ray-project/ray/blob/7c2a200ef84f17418666dad43017a82f782596a3/rllib/models/torch/torch_action_dist.py#L250


      
          def __init__(
              self,
              inputs: List[TensorType],
              model: TorchModelV2,
              *,
              action_space: Optional[gym.spaces.Space] = None
          ):
              super().__init__(inputs, model)
              mean, log_std = torch.chunk(self.inputs, 2, dim=1)
              self.log_std = log_std
              self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
              # Remember to squeeze action samples in case action space is Box(shape)
              self.zero_action_dim = action_space and action_space.shape == ()
          
          @override(TorchDistributionWrapper)
          def sample(self) -> TensorType:
              sample = super().sample()
              if self.zero_action_dim:
                  return torch.squeeze(sample, dim=-1)
              return sample

self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std) + 0.00001)

With rl_module:

github.com

ray-project/ray/blob/7c2a200ef84f17418666dad43017a82f782596a3/rllib/models/torch/torch_distributions.py#L210


      
          @override(TorchDistribution)
          def __init__(
              self,
              loc: Union[float, "torch.Tensor"],
              scale: Optional[Union[float, "torch.Tensor"]],
          ):
              self.loc = loc
              super().__init__(loc=loc, scale=scale)
          
          def _get_torch_distribution(self, loc, scale) -> "torch.distributions.Distribution":
              return torch.distributions.normal.Normal(loc, scale)
          
          @override(TorchDistribution)
          def logp(self, value: TensorType) -> TensorType:
              return super().logp(value).sum(-1)
          
          @override(TorchDistribution)
          def entropy(self) -> TensorType:
              return super().entropy().sum(-1)
          
          @override(TorchDistribution)

return torch.distributions.normal.Normal(loc, scale + 0.00001)

Here the value of 0.00001 is an epsilon to prevent the NaNs. It corresponds to a log_std of -11.512925464970229 and picked it because it will clamp the value if it is less than that and have no appreciable effect on the distribution with larger stds.

I have verified that it does eliminate the NaNs in my cases but I do not actually have a learning problem with this issue so I do not know how it will affect learning. I would expect that it would not but .

Samuel_Fipps · January 14, 2025, 3:18pm

I can try that at some point, have you tried something like this (see below)? Sadly my setup is not easily sharable.

    @override(TorchModelV2)
    def forward(
        self,
        input_dict: Dict[str, TensorType],
        state: List[TensorType],
        seq_lens: TensorType,
    ) -> (TensorType, List[TensorType]):
        obs = input_dict["obs_flat"].float()
        self._last_flat_in = obs.reshape(obs.shape[0], -1)
        self._features = self._hidden_layers(self._last_flat_in)
        logits = self._logits(self._features) if self._logits else self._features
        if self.free_log_std:
            logits = self._append_free_log_std(logits)
 
 
        # ---------------------------------------------------------------------
        # Check for NaNs and replace them with a small constant (e.g. 1e-5).
        # ---------------------------------------------------------------------
        if torch.isnan(logits).any():
            logits[torch.isnan(logits)] = 1e-5

or checking this before it does the log(0):

 logp_ratio = torch.exp(
curr_action_dist.logp(batch[Columns.ACTIONS]) - batch[Columns.ACTION_LOGP]
 )

For zeros and replacing them with 0.000001? just a small number.

Samuel_Fipps · January 17, 2025, 4:01pm

This seems to be promising, my model needs a lot of steps to learn so I wont know until Monday.

Also something that I noticed that I mentioned already log_std_clip_param can slow down my progress by 200%. I’m guessing that its having to do a lot of clipping when training? Not to sure. I gather 51200 steps before training, so that is quite a bit to be looping through and clipping. However it still shouldn’t slow it down that much, unless its not very efficient in the way that it goes about it.

Like numpy.append has to recreate the array every time you call that. So I am wondering if something similar is happening.

mannyv · January 19, 2025, 5:11pm

@Samuel_Fipps,

Fingers crossed.

I am not sure why log_std_clip_param is slowing down training so much. The torch clamp operation itself is not very computationaly intensive so I would not expect that to be the cause.

I am kind of taking a shot in the dark here but my guess would be that there are a lot of instances that are being clamped. The clamping operation is not expensive but a clapped item has no derivate and so it kills the gradient. Any sample that is clamped will not be used to train any of the layers below it. That is my guess as to why training slows.

tlaurie99 · January 20, 2025, 3:00pm

Hey @Samuel_Fipps and @mannyv ,

I have ran a few tests today with and without the log_std_clip_param over 250k timesteps and I am getting a wall time difference of ~3% over 10 runs. This is a pretty simple test using CartPole-v1 and small network sizes, but seems to indicate that something else might be going on with your code and/or setup. When you look at the ray_results the progress excel sheet, do you see a big difference between time_this_iter_s between with and without the parameter? Or do any of the other time metrics increase very largely?

Also, to clarify my point above @mannyv – I have found that using the log_std_devs as a parameter of the model doesn’t force my agent not to learn, but slows down learning. I ran tests using PyFlyt’s Dogfighting Environment where I pin two agents against each other. Using the same model, but with one as the log_std_devs as a parameter of the model and the other clamped I found that the clamped model will outperform the parameterized model. Now, this is purely allegorical and not scientific at all but these were my findings when testing a few months ago. Each model was flipped between agent 1 / agent 2 and ran for 50M+ steps around 5 different times.

@Samuel_Fipps is it possible to share with us your action_space? Or how you are jump-starting with the 51K steps? This isn’t all that many timesteps in the grand scheme of things.

Hopefully you have found a fix that works for you currently!

Tyler

mannyv · January 20, 2025, 6:36pm

@tlaurie99,

The really big change that enabling log_std_devs is that there is now one std for each dimension of the box action space. It is a parameter that can be learned but it is no longer observation dependent. If the optimal variance of the action changes throughout an episode then this is not possible with this setting.

It occurred to me as I was thinking about this that instead of using one linear layer to produce both the means and stds of the distribution you could have two independent layers, one layer for each. Not sure what effect it would have.

Samuel_Fipps · January 21, 2025, 4:14pm

I’m not using jumpstarting at the moment, I’m just trying to find a method to get rid of the Nan’s without affecting how my model learns.

I’ll have test results soon, I keep losing power right now to my pc.

I do have the time_this_iter_s plotted for you to see. This is with the clamping

Samuel_Fipps · January 21, 2025, 4:28pm

Samuel_Fipps · January 23, 2025, 7:19pm

I uploaded some initial results of using
return torch.distributions.normal.Normal(loc, scale + 0.00001)
The first picture is without it and it ended with a NaN crash.
The second picture is with the change, and it seems to still be learning well. I don’t think the unstable learning is coming from the addition, but I do plan on running more test with it in to see.

Samuel_Fipps · January 23, 2025, 7:22pm

Samuel_Fipps · February 11, 2025, 1:07pm

Never mind it still broke, this time I wasn’t using shared layers.

  File "\Python\Python311\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1154, in _worker
    self.loss(model, self.dist_class, sample_batch)
  File "\Python\Python311\site-packages\ray\rllib\algorithms\ppo\ppo_torch_policy.py", line 85, in loss
    curr_action_dist = dist_class(logits, model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File \Python\Python311\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 251, in __init__
    self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std) + 0.00001)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \Python\Python311\site-packages\torch\distributions\normal.py", line 57, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File \Python\Python311\site-packages\torch\distributions\distribution.py", line 70, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (256, 3)) of distribution Normal(loc: torch.Size([256, 3]), scale: torch.Size([256, 3])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan],

Samuel_Fipps · February 12, 2025, 2:05pm

Also I am using a fully custom model

Topic		Replies	Views
Implementing Jump Start Reinforcement Learning in RLLib RLlib	8	1160	May 27, 2022
Question about internal states to the environment RLlib	2	368	October 4, 2021
Rllib trainig step customize RLlib	6	549	March 31, 2021
Different step alignment for agent and environment RLlib	1	518	July 29, 2021
Rllib, parameters for nb_steps_warmup? RLlib	2	336	August 31, 2021

Jump-Start Reinforcement Learning

Related topics