Dear Ray Team,
I am working on a project with a custom environment, where the agent is supposed to either buy or sell or do nothing at any given hour within one day (24 hours). I have a vanilla DQN implementation in PyTorch, which seems to solve this problem via trial and error fairly quickly (3,000 episodes of exploration with epsilon greedy, about 5 min).
Now given that I want to tune hyperparameters at scale and try other algorithms, I am using RayTune and RLlib. In a first step, I tried setting up a DQN with RLlib and compare the results, but the agent seems to learn do nothing, instead of buying and selling to reap a profit. I have tried extensive exploration (up to 50,000 episodes, instead of 3,000), changing hyperparameters, but it always seems to be the same result. I have also tried to train the agent for a much longer time. The environment and training data are exactly the same as for the PyTorch implementation. Below is a picture of the agent just converging to zero in RLlib:
I am defining the agent as such:
# Trainer
agent = (DQNConfig()
.environment('VPS-custom', env_config=config)
.framework('torch')
.rollouts(num_rollout_workers=35)
.training(model={'fcnet_hiddens': [64]},
gamma=0.999,
lr=0.0005,
target_network_update_freq=25*24,
tau=0.01)
.exploration(explore=True))
agent.exploration_config.update({'type': 'EpsilonGreedy',
'initial_epsilon': 1.0,
'final_epsilon': 0.02,
'epsilon_timesteps': 24*3_000,})
agent.build()
And then proceed training like so:
actions = []
reward_mean = []
episodes_total = []
for idx in itertools.count(1):
train_info = agent.train()
reward_mean.append(train_info['episode_reward_mean'])
episodes_total.append(train_info['episodes_total'])
clear_output(wait=True)
print(f"Training iteration: {idx}")
if rewards[-1] > 30_000:
print("Exiting training")
break
Is there an obvious mistake in my approach? If not, why can’t I get the same results on DQN as with my own PyTorch implementation, which solves the problem in about 5 minutes?
Thank you very much for your help, it’s much appreciated as I’m at a dead-end here on my own.