DQN in RLlib not leading to the same results as Vanilla PyTorch Implementation

Dear Ray Team,

I am working on a project with a custom environment, where the agent is supposed to either buy or sell or do nothing at any given hour within one day (24 hours). I have a vanilla DQN implementation in PyTorch, which seems to solve this problem via trial and error fairly quickly (3,000 episodes of exploration with epsilon greedy, about 5 min).

Now given that I want to tune hyperparameters at scale and try other algorithms, I am using RayTune and RLlib. In a first step, I tried setting up a DQN with RLlib and compare the results, but the agent seems to learn do nothing, instead of buying and selling to reap a profit. I have tried extensive exploration (up to 50,000 episodes, instead of 3,000), changing hyperparameters, but it always seems to be the same result. I have also tried to train the agent for a much longer time. The environment and training data are exactly the same as for the PyTorch implementation. Below is a picture of the agent just converging to zero in RLlib:

I am defining the agent as such:

# Trainer
agent = (DQNConfig()
         .environment('VPS-custom', env_config=config)
         .framework('torch')
         .rollouts(num_rollout_workers=35)
         .training(model={'fcnet_hiddens': [64]},
                   gamma=0.999,
                   lr=0.0005,
                   target_network_update_freq=25*24,
                   tau=0.01)
         .exploration(explore=True))

agent.exploration_config.update({'type': 'EpsilonGreedy',
                                  'initial_epsilon': 1.0,
                                  'final_epsilon': 0.02,
                                  'epsilon_timesteps': 24*3_000,})

agent.build()

And then proceed training like so:

actions = []
reward_mean = []
episodes_total = []

for idx in itertools.count(1):
    train_info = agent.train()
    reward_mean.append(train_info['episode_reward_mean'])
    episodes_total.append(train_info['episodes_total'])
    
    clear_output(wait=True)
    print(f"Training iteration: {idx}")

    if rewards[-1] > 30_000:
        print("Exiting training")
        break

Is there an obvious mistake in my approach? If not, why can’t I get the same results on DQN as with my own PyTorch implementation, which solves the problem in about 5 minutes?

Thank you very much for your help, it’s much appreciated as I’m at a dead-end here on my own.