Minimizing loading time - using GPUs

valentina · May 23, 2023, 11:23am

Hi,

I am using ray to parallelize inference. For this task I store the model in the object store (via ray.put). Now I wondered whether this would be saved on the gpu automatically if I initiated ray with gpus. Or would it make sense to move the model to the gpu (via model.to(device_gpu))?
My idea is to minimize the time required to load the model. However, I also have multiple gpus so I also wondered whether this would mean having to copy it to all gpus?

I’d appreciate any help, cheers!

Topic		Replies	Views
Optimizing Real-Time ML Model Serving with Ray Serve on AWS GPU Cluster: Best Practices and Resource Allocation Strategies Ray Data	0	199	April 18, 2024
Utilising Ray for Simple Parallelism (Batch Inference)	1	905	March 28, 2023
[Core] Question on optimizing machine learning project speed using ray Ray Core	5	462	February 1, 2022
Ray Serve Model Worker Replicas Created But GPU Usage is 0% during Inference Ray Serve	7	948	January 19, 2022
Increase efficiency using PyTorch + GPU for inference Ray Core	1	725	July 17, 2022

Minimizing loading time - using GPUs

Related topics