How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I have trained and saved a PPO model. Next, I will use Ray Train to train a Deep Learning model in which inference from this trained PPO will be used as a preprocessing step for my data. However, when the code enters train_func_per_worker, it appears to be stuck there without showing any errors/outputs. I have tried loading the same model outside Ray Train environment, and it seems to work fine, so apparently the problem is with how I’m using Ray Train.
This is the script used to train and save the model:
import os
import gymnasium as gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.registry import register_env
env_name = "CartPole-v1"
register_env(env_name, lambda env_config: gym.make(env_name))
ppo = PPOConfig().environment(env=env_name).build()
for i in range(10):
result = ppo.train()
save_folder = os.path.abspath("RL_OUTPUTS")
save_result = ppo.save(save_folder)
path_to_checkpoint = save_result.checkpoint.path
print(f"Trained model saved: {path_to_checkpoint}")
ray.shutdown()
Following is the script I’m using to load the model and then use its inference as a pre-processing step for my data before passing it to the main model (which will be trained using Ray Train):
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from typing import Dict
import ray.train
import os
import gymnasium as gym
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune.registry import register_env
def train_func_per_worker(config: Dict):
print("DEBUG: Entered into train_func_per_worker")
# had to change the directory in order to load the model from local repository.
current_script_dir = os.path.dirname(os.path.abspath(__file__))
os.chdir(current_script_dir)
env_name = "CartPole-v1"
register_env(env_name, lambda env_config: gym.make(env_name))
save_folder = os.path.abspath("RL_OUTPUTS")
ppo = Algorithm.from_checkpoint(save_folder)
print("Algorithm Loaded")
# train main model after this
def train_summarization_model(num_workers=2, use_gpu=False):
global_batch_size = 32
train_config = {
"lr": 1e-3,
"epochs": 10,
"batch_size_per_worker": global_batch_size // num_workers,
}
# configure computation resources
scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)
# initialize Ray TorchTrainder
trainer = TorchTrainer(
train_loop_per_worker= train_func_per_worker,
train_loop_config=train_config,
scaling_config=scaling_config,
)
# [4] Start distributed training
result = trainer.fit()
if __name__ == "__main__":
# had to set this variable as unable to run without this on Pytorch 2.5.
ray.init(runtime_env={"env_vars": {"USE_LIBUV": "0"}})
train_summarization_model(num_workers=2, use_gpu=False)
Following are my library versions while using a Windows machine:
PyTorch version: 2.5.1+cpu
Ray version: 2.40.0
I have also tried executing this on Google Colab (to rule out any issues with my OS) but the same issue persists.