I am currently using ray[default] 1.3.0 with pytorch 1.3.1 to implement multi-agent reinforcement learning with each agent using one ray actor and all agents sharing a GPU. I encountered a problem that no matter how small the neural network or the batch size is, each worker always takes at least 1GB GPU memory. It seems to be a waste of resources, how can I avoid this?
Can you share a bit more context? How does your configuration look like? How does the training code look like? What kind of GPU are you using?
Also cc @sven1977 for rllib and @ericl for resource allocation
Thank you for your reply! I am using Nvidia Tesla V100 GPUs with CUDA 11.1. The actor class looks like
@ray.remote(num_gpus = 1/8, num_cpus=1)
class Worker(object):
"""
A ray actor wrapper class for multiprocessing
"""
def __init__(self, agent_fn, device, **args):
self.device = torch.device(device)
self.instance = agent_fn(**args).to(self.device)
def roll(self, **data):
return self.instance.roll(**data)
def updateP(self, **data):
return self.instance.updateP(**data)
def updateQ(self, **data):
self.instance.updateQ(**data)
def _evalQ(self, **data):
return self.instance._evalQ(**data)
def updatePi(self, **data):
self.instance.updatePi(**data)
def act(self, s, deterministic=False, output_distribution=False):
return self.instance.act(s, deterministic, output_distribution)
And the ray initialization config is
os.environ['RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE']='1'
ray.init(ignore_reinit_error = True, num_gpus=1)
The memory is consumed immediately after the workers are initialized, so I assume the training code is irrelevant.
self.agents = []
for i in range(n_agent):
agent = Worker.remote(agent_fn=agent_fn, device=device, logger = logger.child(f"{i}"), env=env, **agent_args)
self.agents.append(agent)
Also, when I run the same code on Nvidia P100 GPUs with CUDA 10.2, the overhead is reduced to about 700MiB from 1085MiB with V100 and CUDA 11.1.