How to reduce GPU memory consumption overhead of actor workers

I am currently using ray[default] 1.3.0 with pytorch 1.3.1 to implement multi-agent reinforcement learning with each agent using one ray actor and all agents sharing a GPU. I encountered a problem that no matter how small the neural network or the batch size is, each worker always takes at least 1GB GPU memory. It seems to be a waste of resources, how can I avoid this?

Can you share a bit more context? How does your configuration look like? How does the training code look like? What kind of GPU are you using?

Also cc @sven1977 for rllib and @ericl for resource allocation

Thank you for your reply! I am using Nvidia Tesla V100 GPUs with CUDA 11.1. The actor class looks like

@ray.remote(num_gpus = 1/8, num_cpus=1)
class Worker(object):
    """
    A ray actor wrapper class for multiprocessing
    """
    def __init__(self, agent_fn, device, **args):
        self.device = torch.device(device)
        self.instance = agent_fn(**args).to(self.device)

    def roll(self, **data):
        return self.instance.roll(**data)

    def updateP(self, **data):
        return self.instance.updateP(**data)

    def updateQ(self, **data):
        self.instance.updateQ(**data)

    def _evalQ(self, **data):
        return self.instance._evalQ(**data)

    def updatePi(self, **data):
        self.instance.updatePi(**data) 

    def act(self, s, deterministic=False, output_distribution=False):
        return self.instance.act(s, deterministic, output_distribution)

And the ray initialization config is

os.environ['RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE']='1'
ray.init(ignore_reinit_error = True, num_gpus=1)

The memory is consumed immediately after the workers are initialized, so I assume the training code is irrelevant.

    self.agents = []
    for i in range(n_agent):
        agent = Worker.remote(agent_fn=agent_fn, device=device, logger = logger.child(f"{i}"), env=env, **agent_args)
        self.agents.append(agent)

Also, when I run the same code on Nvidia P100 GPUs with CUDA 10.2, the overhead is reduced to about 700MiB from 1085MiB with V100 and CUDA 11.1.