This works. When I look at nvidia-smi however, I can see that each GPU is used only at 12% maximum, each. So I was wondering: why not set num_replicas to 4 and set “num_gpus” to 0.5? But unfortunately, it does not work.
Can you explain me why ? (Note: I am using a single node cluster for the moment)
Edit: I am testing this configuration by sending a bunch of ~1000 HTTP requests at the same time to the server.
Thanks for reporting this. Could you please share more details about the error you’re running into? I think what you’re describing should work. One guess is maybe you don’t have 24 CPUs available on your machine so there aren’t enough resources for 4 replicas.
Actually, I have another issue: I am not sure how to allocate the cpus/threads to my workers.
First of all, when I do htop I can see 15 cpus, but when I use only one replica and set num_cpus to 15 it does not work, the only configuration that works is 13 cpus I don’t know why. It was suggested to use ray.nodes() to find insights but doing that I can clearly see that ray finds 15 cpus.
Second of all, I don’t know how to set OMP_NUM_THREADS in the following command: OMP_NUM_THREADS=16 ray start --head
It is said here that “to avoid performance degradation with many workers” it should be set to 1, but I have one worker and 1 to 3 replicas, should I increase the number of threads then ?
Hi Tim, sorry about that, the reason you can only fit 13 replicas is that two CPUs are already being used internally by Ray Serve, one for the Serve Controller actor and one for the HTTP proxy actor–we should probably make this more clear somewhere. In general, you can see what’s using your CPUs by looking at the Ray Dashboard.
I’m not sure the best answer to the OMP_NUM_THREADS question, @simon-mo do you know?