[rllib] gpu sampling memory and performance issues

I am using rllib with a large Transformer Model. That’s why I want to use the GPU for sampling.
However, as far as I know rllib only supports one thread per GPU model. This creatures gpu OOM errors when I want to use all of my 24 cpu cores. rllib wants to create 24 GPU models and each of them uses a few GB memory.

Did I overlook an option in rllib or is someone currently working on this problem?

I would imagine that the num_cpus_per_worker option creates that many actors which each drive several environments. And all of these actors receive actions from one central GPU model.