Custom Trainable RLLib for Ray 2.3.0 with Ray tune

I am trying to build a trainable class for a Stock Market trading environment, where the agent will train at a particular duration of time, then validate with the validation environment. As the default RLlib Algorithm when passed like a string to the trainable class does not do this out of the box, hence I am trying to build a custom Trainable Class (If there is a way and I am missing something, please let me know).

My custom trainable class looks like the following

from ray.tune.logger import pretty_print
from ray.rllib.algorithms.ppo import PPOConfig
class MyTrainableClass(tune.Trainable):
    def setup(self, config:dict):
        self.train_iters = config["training_iterations"]
        self.algo = ppo.PPO(config=config)
    
    def step(self):
       for i in range(self.train_iters):
              results = self.algo.train()
        sharpe_ratio = validation_func(self.algo,validation_env)
        session.report({"sharpe":sharpe_ratio})
        return results
    
    def save_checkpoint(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
        self.algo.save_checkpoint(checkpoint_path)
        return tmp_checkpoint_dir

    def load_checkpoint(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
        self.algo = self.algo.restore(checkpoint_path)

But in Ray 2.3.0, we are expected to port our environment from the old gym style to Gymnasium style. After doing it, I am getting this error

2023-03-20 02:15:22,685	INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(RolloutWorker pid=1663) 2023-03-20 02:15:28,133	WARNING env.py:156 -- Your env doesn't have a .spec.max_episode_steps attribute. Your horizon will default to infinity, and your environment will not be reset.
(RolloutWorker pid=1663) 2023-03-20 02:15:28,133	WARNING env.py:166 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.

This environment worked well when I used the default trainable RLlib algorithms like trainable=“PPO”, but it is failing for it. Even if I changed the environment to the earlier style, it is complaining again that my environment checker is failing (also I did disable_env_checking=True, still it is failing).

So is there a better way to do this? Training followed by validation. And am I missing something in the above code?

python==3.10.6
Ray==2.3.0
Ubuntu 22.04

Hi @Athe-kunal,

Thanks, good catch. I’ll open a PR to remove these warnings.
Although they are actually warnings, not errors.
→ Are you also getting errors?

Cheers

My testing stops after getting these warnings. I had to change my testing environment to old style to get this working.
Also, is there a better way to use our RL agent for validation set with RLlib? As mentioned int eh question, I am trying to train my model and then tune it on the validation set to get the best hyperparameters. Is there a better way to do that apart from Custom Trainable? If not, please guide me to a proper resource to do this properly.

Thanks
Astarag

Your training stops as in hangs? Not error but also no progress?
Usually you would use the same environment for training and validation.
If you are using offline learning, you can separate those datasets.
We have an offline RL example that illustrates how a config for evaluation would be nested inside the regular config. You could use that to specify anything diverging from the training config, like another path to a separate dataset for evaluation.

Best

Hi @arturn, thank you for clearing that out
I was using ON policy RL and I am using the same environment for training and validation. Just the changes to conform to the new gymnasium style, which is reset have an extra argument and return truncated in the step function.

So I wanted to build a trainable class, where the RL agent trains on the training dataset, and I will probably checkpoint it after training to use it for the validation set. After running on the validation set, I will report the metric (in the stock market the Sharpe ratio) on the validation set, which will help the Bayesian hyperparameter optimization search algorithm to pick the next best set of hyperparameters. I tried to do that (you can see the code above in the first part of the thread), but it is not working for me, can you tell me if I am missing something?

Also, if I am using a Bayesian Search algorithm and I am picking the next best set of hyperparameters where I will pass the metric="episode_reward_mean and mode='max', is it picking the next best set of hyperparameters based on the train episode reward mean? We generally use a validation set for the search algorithm, if I am not wrong. Is it doing something similar? If not, is it fine to do train episode_reward_mean for the bayesian search algorithm?

What is the point in checkpointing between training and validation?
RLlib’s validation uses the latest trained policy. It can also alternative between training and validation. Which afaics is what you are looking for.
Maximizing episode_reward_mean is the default in RLlib; you should not have to specify it explicitly.

Please post a full reproduction script that produces the error you refer to so that we can look into it.

Thank you for your response. I am getting this error when I am running with a custom trainable class


As you can see that the run stops after that even if it is a warning. The reproduction script is hard to generate as I am using some locally saved dataframes, sorry about that. I am also attaching my custom trainable class.

class MyTrainableClass(tune.Trainable):
    def setup(self, config:dict):
        self.algo = (PPOConfig().environment(env="StockTrading_train_env").framework('torch').build())
    
    def step(self):
        results = self.algo.train()
        print(pretty_print(results))
        return results
    
    def save_checkpoint(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
        self.algo.save_checkpoint(checkpoint_path)
        return tmp_checkpoint_dir

    def load_checkpoint(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
        self.algo = self.algo.restore(checkpoint_path)

My CustomTrainable class looks like above code. Probably RLlib still expects environments that were before the gymnasium 0.26.2. Please let me know if I am not providing some information.

Also, I wanted to ask whether Ray tune automatically sets aside a validation set to calculate the episode_reward_mean of it to select the next best set of hyperparameters by bayesian optimization. If yes, can you redirect me to resources where I can check how it is dividing the training and validation set?