Hello,
I am currently trying to apply RL on a global optimization problem. I was able to apply the single agent ‘soft actor-critic’ method on my custom environment using Stablebaselines 3 library. And now I would like to continue this research by applying multi-agent RL using the RLlib of Ray. I have been trying to replicate the results I got from stablebaselines 3 using RLlib but I have not been able to reproduce the results.
My custom environment has the following parameters:
- Single agent environment (Convert it to multi-agent in the future)
- continuous action space ranging (0,1) , dim = (4,)
- observation space, dim = (16,)
- each episode terminates after 1 timestep.
The major problem I am facing is that the actions are always extreme. They are usually very close to 1 or equal to 0 most of the times.
Can someone tell me what I am doing wrong?
from ray import tune
import RlHelper
from ray.rllib.agents.sac import SACTrainer
from ray.rllib.agents.ppo import PPOTrainer
from single_env_rllib import env
import ray.rllib.agents.sac as sac
import ray.rllib.agents.ddpg as ddpg
import ray.rllib.agents.ppo as ppo
import ray
import numpy as np
import matplotlib.pyplot as plt
from ray.rllib.agents.registry import get_trainer_class
filename='/home/ichbinram/Documents/IFN/dataset/dset4.h5'
agent_class, config = get_trainer_class("SAC", return_config=True)
config['env'] = env
config['framework'] = 'torch'
config['lr'] = 0.0003
config['horizon'] = 1
config['normalize_actions'] = True
config['timesteps_per_iteration'] = 200
stop = {'timesteps_total':200000}
log_dir = './trials'
trainer = RlHelper.RlHelper(config=config,save_dir=log_dir)
checkpoint_path, analysis = trainer.train(stop_criteria=stop)
trainer.load(checkpoint_path)
reward,p_max = trainer.test(filename)
plot = np.reshape(reward, (int(len(reward)/51),51))
plot = np.nanmean(plot, axis=0)
plt.plot(p_max,np.transpose(plot))
plt.ylabel('reward (Mbits/J)')
plt.xlabel('p_max (dBW)')
plt.show()
I have also created a custom training class as shown in this github issue.