Unable to load trained RL Model with Ray Train

Abid_Ali · January 6, 2025, 7:04am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have trained and saved a PPO model. Next, I will use Ray Train to train a Deep Learning model in which inference from this trained PPO will be used as a preprocessing step for my data. However, when the code enters train_func_per_worker, it appears to be stuck there without showing any errors/outputs. I have tried loading the same model outside Ray Train environment, and it seems to work fine, so apparently the problem is with how I’m using Ray Train.

This is the script used to train and save the model:

import os
import gymnasium as gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.registry import register_env

env_name = "CartPole-v1"
register_env(env_name, lambda env_config: gym.make(env_name))

ppo = PPOConfig().environment(env=env_name).build()

for i in range(10):
    result = ppo.train()

save_folder = os.path.abspath("RL_OUTPUTS")

save_result = ppo.save(save_folder)

path_to_checkpoint = save_result.checkpoint.path

print(f"Trained model saved: {path_to_checkpoint}")
ray.shutdown()

Following is the script I’m using to load the model and then use its inference as a pre-processing step for my data before passing it to the main model (which will be trained using Ray Train):

from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from typing import Dict
import ray.train
import os

import gymnasium as gym
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune.registry import register_env

def train_func_per_worker(config: Dict):
    print("DEBUG: Entered into train_func_per_worker")

    # had to change the directory in order to load the model from local repository.
    current_script_dir = os.path.dirname(os.path.abspath(__file__))
    os.chdir(current_script_dir)

    env_name = "CartPole-v1"
    register_env(env_name, lambda env_config: gym.make(env_name))

    save_folder = os.path.abspath("RL_OUTPUTS")
    ppo = Algorithm.from_checkpoint(save_folder)
    print("Algorithm Loaded")

    # train main model after this

def train_summarization_model(num_workers=2, use_gpu=False):
    global_batch_size = 32
    train_config = {
        "lr": 1e-3,
        "epochs": 10,
        "batch_size_per_worker": global_batch_size // num_workers,
    }

    # configure computation resources
    scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)

    # initialize Ray TorchTrainder
    trainer = TorchTrainer(
        train_loop_per_worker= train_func_per_worker,
        train_loop_config=train_config,
        scaling_config=scaling_config,
    )

    # [4] Start distributed training
    result = trainer.fit()

if __name__ == "__main__":
    # had to set this variable as unable to run without this on Pytorch 2.5.
    ray.init(runtime_env={"env_vars": {"USE_LIBUV": "0"}})

    train_summarization_model(num_workers=2, use_gpu=False)

Following are my library versions while using a Windows machine:

PyTorch version: 2.5.1+cpu
Ray version: 2.40.0

I have also tried executing this on Google Colab (to rule out any issues with my OS) but the same issue persists.

starkj · February 18, 2025, 2:29am

Offhand, I don’t have any suggestions for you, but I am ready to start dealing with loading PPO checkpoints in the new API (migrated a successful app from the old API), so a couple questions for you. Maybe we can work through this together.

Are you using the “new API”? I was unable to tell by looking at your code.
When you say you’re able to load the checkpoint outside, is that through plain torch methods? I would be interested in seeing that code as well. It seems that the new API has changed the checkpoint file structure, which makes things more confusing.

Topic		Replies	Views
How to make checkpoint by ray.tune.run and load it? RLlib	3	2783	July 7, 2022
PPO from checkpoint Checkpointing, Restoring	0	47	September 10, 2024
Save model parameters on each checkpoint Ray Tune	21	3365	March 29, 2023
Crash when calling .train() after loading from checkpoint RLlib	2	406	February 9, 2022
Cannot checkpoint a simple model RLlib	4	106	June 6, 2025

Unable to load trained RL Model with Ray Train

Related topics