Unable to access custom model functions when wrapping it with torch.prepare_model

AxelN · March 10, 2023, 10:28am

We use:
ray 2.3.0
torch: 1.13.1

We have a model that have custom functions, example:

class ExampleNetwork(torch.nn.Module):
    def forward(self):
        ...

    def custom_function_1(self):
        ...

    def custom_function_2(self):
        ...

model = ray.train.torch.prepare_model(ExampleNetwork())

When you use the function ray.train.torch.prepare_model, the resulting model works differently depending on the number of workers we use in the TorchTrainer (ray.train.torch.TorchTrainer).

If we only use one worker, the model do not need to be parallelized and is therefore still a nn.Module, but if we use multiple workers it needs to be parallised and instead become a DistributedDataParallel (torch.nn.parallel.DistributedDataParallel).

The problem with this is that when the model is parallelized, the prepare_model-function do not wrap the custom functions, and to access these you need to change the calls from:
model.custom_function → model.module.custom_function
The standard functions like forward still works as intended.

It is not really scalable to add checks for the type of the model before every function call, so I am wondering if there is another way to prepare the model or a way to wrap the resulting model so we do not need to have different training code dependent on the number of workers we are going to be training on.

gjoliver · March 13, 2023, 7:57pm

Hi, this is a really good question.

As you said, we by-pass the entire wrapping logic if world_size <= 1:

github.com

ray-project/ray/blob/master/python/ray/train/torch/train_loop_utils.py#L327


      
              # serializable. When serializing the model, we have to override the
              # `__getstate__` method to set back the original forward method.
              if hasattr(model, "__getstate__"):
                  model._original_get_state = model.__getstate__
              # `__getstate__` must be a bound method rather than an callable attribute.
              # See https://stackoverflow.com/questions/972/adding-a-method-to-an-existing-object-instance.  # noqa: E501
              model.__getstate__ = types.MethodType(model_get_state, model)
          
          
world_size = session.get_world_size()
          
          
if parallel_strategy and world_size > 1:
              if parallel_strategy == "ddp":
                  DataParallel = DistributedDataParallel
                  if torch.cuda.is_available():
                      parallel_strategy_kwargs = {
                          "device_ids": [device],
                          "output_device": device,
                          **parallel_strategy_kwargs,
                      }
              else:
                  if not torch.cuda.is_available():

I will discuss this with the team internally and see how we can make this part of the experience better.
thanks again for the feedback.

AxelN · March 14, 2023, 7:53am

Thanks for he response and appreciate you taking this further.

One temporary solution I have found so far, but have not tested out extensively so might bring unintended consequences, is to do:

wrapped_model = ray.train.torch.prepare_model(ExampleNetwork())
if ray.air.session.get_world_size() > 1:
    model = wrapped_model.module
else:
    model = wrapped_model

gjoliver · March 14, 2023, 8:08am

yeah, seems like a nice workaround

Jules_Damji · March 14, 2023, 4:54pm

Thanks @gjoliver. @AxelN , it seems you all sorted.

Topic		Replies	Views
How to configure prepare_model Ray Train	4	721	April 3, 2023
Issue with Custom PyTorch Model in Ray RLlib RLlib	0	304	November 3, 2023
Custom eval function error with custom RNN model RLlib	0	299	April 14, 2022
Unable to specify custom model in PPOConfig Configure Algorithm, Training, Evaluation, Scaling	1	57	September 8, 2024
PPO+LSTM custom model implementation problem ray2.10.0 Configure Algorithm, Training, Evaluation, Scaling	3	169	May 9, 2024

Unable to access custom model functions when wrapping it with torch.prepare_model

Related topics