if I have say 10 projects I would like to use ray with, if I deploy all of them in my cluster, they will all share the same runtime, but would this influence the performance somehow? I mean I’m planning to use ray, if it fits my needs, for quite a lot of stuff. I received an answer stating that ray doe snot guarantee segregation…this is unfortunate, but what happens if I have a kubernetes cluster and I create 5 ray clusters in it? IN that case there should be segregation…but what about performance? I’ve been having some peformance botlenecks in teh past when running too many processes
You can certainly run multiple applications on a single Ray cluster. However, there are some limitations that exist today. In particular, there is not much isolation between the applications, so one application could hog the whole cluster or a task could use up all the CPU resources or memory.
You can also run multiple Ray clusters independently on Kubernetes with one (or more) Ray application per cluster. Then you would inherit isolation properties from Kubernetes.
@delioda79, if I understand correctly I am trying to work through a similar issue: Ray exec multiple scripts w/ tune.run() to same ray cluster - #6 by Alex
I am now exploring different methods to fire up a cluster for each process and would be interested to hear how your experiences with Kubernetes to manage is working for you?
If you’re interested in starting ray clusters on k8s, I’d recommend checkout out our fancy new ray operator. This essentially runs the autoscaler you know and love, but as a k8s operator (instead of a process on the head node).
It is fancy and new, so feedback, bug reports, etc are welcome.
The link doesn’t exist
I’m planning, as @rkn mentioned, to run one ray cluster per pod, so it will be local mode per each pod. They will be isolated, but I’m wondering what this will imply in terms of perofrmance and memory, as ray starts a redis server per each cluster
I guess this; The Ray Kubernetes Operator — Ray v1.2.0
I would expect resource wise, there is some waste, but scheduling latency, and other performance, should be same.
I’m currently exploring two options for our users:
- share a ray cluster
- one app per cluster
Well, after some playing around we are pushing to production soon the stuff we did with ray and we are gonna use it for most of our services. Now we did not see any big overhead by using ray, but there is some. Now I think that we need to really be careful with the use case. I experimented with just having some database access class running as multiple actors in a fastAPI project, and this was actually degrading performance, but we get most of our services asynchronous communicating via rabbitmq nessages, and we use ray to run the rabbit client in a process, do some computation in other processes, and we get out of the box prometheus metrics. This performs very well