Will the added model be saved and loaded?

Halman · June 25, 2023, 6:31pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Reinforcement learning is performed using the SAC algorithm in rllib ver 2.3.

Since image information is used as input, the model is connected to CNN and MLP.

In my experiments, I am adding a VAE model to the training by adding a Decoder as well with the CNN as the Encoder.

In that case, when I perform resuming and relearning, it appears that the model is not properly resumed with all model parameters in behavior.

Therefore, I suspect that if we add a new model (VAE in this case), there may be a point where the network parameters are not being saved or loaded properly.

Am I correct in my assumption? If so, what code modifications would need to be made?

kourosh · June 25, 2023, 7:16pm

Hi @Halman,

If you look at the RLlib code, we do save and load all the states.

github.com

ray-project/ray/blob/master/rllib/policy/torch_policy_v2.py#L955-L967


      
          @override(Policy)
          @DeveloperAPI
          def get_weights(self) -> ModelWeights:
              return {k: v.cpu().detach().numpy() for k, v in self.model.state_dict().items()}
          
          @override(Policy)
          @DeveloperAPI
          def set_weights(self, weights: ModelWeights) -> None:
              weights = convert_to_torch_tensor(weights, device=self.device)
              if self.config.get("_enable_rl_module_api", False):
                  self.model.set_state(weights)
              else:
                  self.model.load_state_dict(weights)

It would be helpful if you could share a minimal-repro (not too complicated and reproducible) to validate your hypothesis. Something like a unittest and we could help you from there.

Halman · June 26, 2023, 12:51am

@kourosh

Thank you for answering! I got it.

If all model parameters are stored and loaded, it is unclear what is causing the opposite.

I cannot share my code in its entirety due to confidentiality issues.

However, I can share some of the results of the training to aid in the discussion.

As you can see, I restarted the second study just before the 3Mstep, and I can see that there is a big disconnect between actor_loss and td_error.
On the other hand, for reward, the value drops a little, but it immediately returns to the saturation value of the first learning.

This seems a bit strange behavior, but can these be reasonably interpreted?

Sincerely,

Halman · June 26, 2023, 1:50am

I forgot to post the picture.

Rohan138 · June 29, 2023, 12:19am

Hi @Halman, I would recommend going to the code snippet kourosh mentioned above and printing the model’s state_dict in both get_weights and set_weights, as a starter. Check if the VAE layers are present in the state_dict printed.

Halman · June 29, 2023, 3:43am

@Rohan138 @kourosh

Thank you for your comment.
I found that the set_weights function in Torch_policy_v2.py was not called when I did the resume. Does this indicate that the resume is not working? Or is it possible that some other .py file is handling the resume? (I am using ray ver 2.3 right now)

Right now I am resuming training with resume=Tru picture as follows, is there anything I am doing incorrectly?

    results = tune.run(
        "SAC",
        stop=stop,
        config=config,
        verbose=True,
        checkpoint_at_end=True,
        local_dir=ARGS.exp,
        resume=True,
        fail_fast="raise",
        checkpoint_freq=1000,

Rohan138 · June 30, 2023, 10:34pm

Ah, resume will restart the trial, but not restore the weights. To do the latter, you need to pass in restore=path_to_your_checkpoint, I believe. Also, if you upgrade to Ray 2.5.1, you can instead use the ray.tune.Tuner class, which is better documented and maintained; tune.run will be deprecated shortly.

Topic		Replies	Views
Restoring RLlib Run Using Tuner.restore RLlib	5	616	February 17, 2024
Specify which layers to restore RLlib	1	187	March 27, 2023
Some questions about checkpoint in RLLib RLlib	1	320	May 23, 2023
Save model parameters on each checkpoint Ray Tune	21	3365	March 29, 2023
Restoring Tuned Tuner RLlib	4	51	July 22, 2024

Will the added model be saved and loaded?

Related topics