Ray Actor not utilising GPU

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello there, I’m a bit confused, from my understanding we can assign fractional GPUs to ray actors and so long as we’re not doing anything too fancy it should be fine.

I’m running an RTX-4090 with 24564MiB memory on a local machine. In it I’m running two ray actors, a “policy server” which wraps a Neural network for inference calls and a training function that trains a secondary network. Both are using @ray.remote(num_gpus=0.25)

When viewing Nvidia-smi I can see that the training network is fully on the device with memory allocated to it, the policy server is recognised on nvidia-smi but with 0 MiB.

When i look at the dashboard, i see that it’s not using the GPU.

SS to help

I’m not really sure why though. any advice would be greatly appreciated.

Ray resources are logical. When you assign num_gpus, Ray only does bookkeeping and assigns CUDA_VISIBLE_DEVICES and alike. To really use GPU memory you need to do it manually in code. Did you actually use GPU in Policy_Server code?

Hi, thanks for responding.

I believe so, the policy server wraps a neural network, moves it to the the GPU and then another worker calls inference on their data remotely.

At the moment i’m really only working on a skeleton of the full set up so i can post exact code:

@ray.remote(num_gpus=0.25)
class Policy_Server:
    def __init__(self, device):
        self.device = device
        self.model = net().to(self.device)
        self.model.eval()

    def update_model_parameters(self, state_dict: Dict[str, Tensor]) -> str:
        with self.lock:
            self.model.load_state_dict(state_dict=state_dict)
            return "Model Parameters updated..."

    async def inference(
        self, observation: Union[Tensor, np.ndarray]
    ) -> Tuple[np.ndarray, np.ndarray]:
        # print(self, observation.shape)
        batch = t.tensor(observation).float().to(self.device)
        with t.no_grad():
            p, v = self.model(batch)
            return p.detach().cpu().numpy(), v.detach().cpu().numpy().flatten()

    def model_init(self, path: Path = Path("./Checkpoints/best_model.pth")):
        self.update_model_parameters(t.load(path, weights_only=True))

and

@ray.remote
class self_play_worker:
    def __init__(self, policy_server, buffer, num_boards, num_reads, index):
        self.policy_server = policy_server
        self.buffer = buffer
        self.num_boards = num_boards
        self.num_reads = num_reads
        self.running = True
        self.index = index

    async def self_play(self):
        """Below is dummy code""
        while self.running:
            data = []
            range_of_index = np.random.randint(5, 20)
            for _ in range(range_of_index):
                s = np.random.randint(0, 2, size=(111, 8, 8))
                ref = self.policy_server.inference.remote(s)
                fut: asyncio.Future = asyncio.wrap_future(ref.future())
                p, v = await fut
                data.append(
                    np.hstack(
                        (s.flatten(), np.argmax(p, axis=1).flatten(), v.flatten())
                    )
                )
            result = np.vstack(data)

            await self.buffer.add.remote(result, self.index)
            time.sleep(np.random.randint(5, 20))
        # return False

    def stop_play(self):
        self.running = False

    def start_play(self):
        self.running = True

The only thing i can think of is the future wrap in self_play_worker. but i dont see why.

device is defined in main and is passed to both the Policy_Server and the Trainer. The Trainer has memory allocated, the Policy Server doesn’t it seems.

Hi how is your device passed to Policy_Server? Based on num_cpus=0.25 you will be only assigned with a single CUDA_VISIBLE_DEVICES so you need to pass in cuda:0 to make it work.

hi there, i pass device via

device = t.device("cuda" if t.cuda.is_available() else "cpu")

specifically:

if __name__ == "__main__":
    print("Test")
    device = t.device("cuda" if t.cuda.is_available() else "cpu")

    ray.init(num_gpus=1)

    num_of_workers = 5
    capacity = 10000

    # print("Starting Policy Server and Replay Buffer")
    ps = Policy_Server.remote(device=device)
    time.sleep(1)
    buffer = ReplayBuffer.options(max_concurrency=2).remote()

    print("Starting workers")
    sp_workers = [
        self_play_worker.remote(
            policy_server=ps, buffer=buffer, num_boards=0, num_reads=0, index=i
        )
        for i in range(num_of_workers)
    ]
    trainer = ray_trainer.remote(buffer=buffer, device=device)
    generators = [worker.self_play.remote() for worker in sp_workers]

    trainer.begin_training.remote()
    input("Press any key to exit...\n")
    ray.shutdown()
    print("Shutting down...")
    while ray.is_initialized():
        time.sleep(10)
    print("Shutdown complete...")

changing ‘cuda’ to ‘cuda:0’ didn’t seem to make a difference.

Hi this looks to be good. Can you print this in __init__ and in inference

import torch

# Check if the model is on CPU or CUDA
device = next(self.model.parameters()).device
print(f"Model is on device: {device}")

# Calculate memory used by model parameters (in bytes)
model_memory = sum(param.element_size() * param.nelement() for param in self.model.parameters())
print(f"Memory used by model parameters: {model_memory} bytes")

# If on CUDA, you can also check the overall GPU memory usage
if device.type == "cuda":
    allocated_memory = torch.cuda.memory_allocated(device=device)
    reserved_memory = torch.cuda.memory_reserved(device=device)
    print(f"Allocated GPU memory: {allocated_memory} bytes")
    print(f"Reserved GPU memory: {reserved_memory} bytes")

Thanks for the response. heres the requested prints:

(Policy_Server pid=111436) __init__:Model is on device: cuda:0
(Policy_Server pid=111436) __init__:Memory used by model parameters: 343194896 bytes
(Policy_Server pid=111436) __init__:Allocated GPU memory: 343408128 bytes
(Policy_Server pid=111436) __init__Reserved GPU memory: 364904448 bytes

Starting workers
(Policy_Server pid=111436) inference: Model is on device: cuda:0
(Policy_Server pid=111436) inference: Memory used by model parameters: 343194896 bytes
(Policy_Server pid=111436) inference: Allocated GPU memory: 343408128 bytes
(Policy_Server pid=111436) inference: Reserved GPU memory: 364904448 bytes

which is odd, nvidia-smi and the dashboard still shows nothing.

Odd. This means it’s on cuda, and takes mem, but nvidia-smi can’t see it. Maybe this is an issue from cuda’s stats? Does this really bother the real workload?