RLlib slows down when gpu available but not used

Hello everybody,

I was trying to install the CUDA drivers on my laptop to run RLlib on GPU and detected a strange behavior. I have not found any topic on the matter so if there is another one let me know. I am using Ray 1.1.0. You can reproduce the issue with the following example:

import ray
from ray.rllib.agents.registry import get_agent_class
from ray.tune.logger import pretty_print

training_iterations = 10
evaluation_steps = 300
method = 'DQN'
config = {
    "log_level": "WARN",
    "num_workers": 3,
    "num_envs_per_worker": 8,
    "dueling": True,
    "double_q": True,
    "train_batch_size": 128,
    "model": {"fcnet_hiddens": [128, 64]},
    "env": "CartPole-v0",
    "num_gpus_per_worker": 0,
    "num_gpus": 0

cls = get_agent_class(method)
trainer = cls(config=config)
for i in range(1000):
    result = trainer.train()

When the CUDA drivers are installed around 986 samples/s can be achieved as seen in the image below:

If I delete the cudnn64_8.dll file which is needed by tensorflow to turn GPUs on then 2175 samples/s can be achieved:

The weirdest part is that in the configuration I explicitly configure RLlib to not use the GPU. Is there any other parameter that I am missing? Is this considered normal behavior?