Hi, I understand you are encouraging people to use ray jobs submission instead of using ray.client to connect to a remote cluster.
However these are two completely different uses cases. In case I have very short lived “jobs” where the flow of the job may change according to my inputs, then the job api isn’t useful for me
I need to understand if you plan to deprecate this in the future, as if it is I will need to stop using ray.
Thanks for your message! We are not currently planning to deprecate the Ray Client, but we do encouraging people to use Ray Jobs where appropriate since the Ray Client has some downsides especially for production scenarios (e.g. library versioning between client and cluster, the need for a long running connection between client and server and the client needing to stay active).
We are interested in understanding Ray Client use cases better, if you could give us more insights into your use case (maybe with example code), we would very much appreciate it. In some cases there are better solutions that are more robust.
At least in our use case, ray.client(remote_addr) is useful when doing interactive debugging and analysis. The programmer can quickly try out different implementations of a function foo in jupyter notebook (running off the cluster), and re-run @ray.remote def foo(...): to quickly harness high degree of parallelism of a cluster and get some data analysis results within seconds. Then the programmer can try out a new idea (change a few lines in foo, remove the decorator to run it locally for a quick test (1 second, small input), and add the decorator and run it on remote (3-5 second, huge input).
Using ray job submit would require separately editing a .py file and submit the script (which might not be syntactically correct etc.), adding quite some friction and (human-perceived) latency during interactive debugging and analysis.
So, although I totally agree that long-running jobs are best suited for job submit API, please consider keeping ray.client for this interactive debugging use case!
Right now I create a connection after I run the cluster and keep it open for days (haven’t gotten to the state where it is actually running for days but the system is supposed to run for a long time)
However every 1 minute to 1 hour (different use cases) I plan to close all my existing actors and run a new bunch of actors (so technically I can call the init each time I create the actors and close the connection afterwards if this is a better way to go)
The actors create their state during their initialization and it isn’t supposed to change during their runtime, this is used for caching more than state so calling individual tasks are not useful
In the time I have the actors running I get new data from an outside source every 1 sec.
When the data is received, I call several actors (simultaneously on many different servers but also one after another – output of one is passed as input to another. Since the additional overhead of triggering a function on an actor is in the milliseconds as far as I measured, I didn’t see a problem with that)
When the last actor is done, I get the output in the client (small data) and pass it on to the next system
As far as I understand this doesn’t fit the submit jobs use case
If I didn’t understand correctly please let me know
If you think ray isn’t fit for this use case please also let me know
data ==(1)==> (Ray) ==(2)==> output ==> other system
This is organized with a driver and the driver is not run within the ray cluster, but outside of the ray cluster using ray client.
And you can’t log into your cluster and run the driver there (network setup or security issues).
Ray client gives you one way to interact with what you deployed in the cluster. (1) and (2)
For (1) and (2) if there are other ways to interact with the cluster, it might also works for you, but if the syntax is different, it’ll increase your developing cost.
Sorry I don’t really understand what (1) and (2) are. are they tasks needed to be done on a specific data?
first of all the way I interact with the outside world is through grpc service which I put as part of a process that interacts with the cluster through ray client. I am not sure how to run this process within the cluster
I would be interested in other ways to interact with the cluster. Different syntax doesn’t automatically increase my development cost so I would be happy to understand what it is
@yic any suggestions though?
on how to run a process that interacts via grpc to the outside world within the cluster
or working any other way that will still fit my use case?
Ray client is implemented based on gRPC. So whenever you use ray client, the interaction between your code and ray cluster actually is via gRPC.
There is some gaps for complicated use cases, especially when there is some context set in the code, because ray primitives are sent through gRPC, so context setup might not be there in the server. But I do think this is a convenient way to interact with ray cluster.
So for now do you suggest I keep using ray client to connect to my cluster?
If not what would you recommend?
Do you think my use case doesn’t fit ray system at all and I should stop using it?
is there an option to run a script (on its on virtual environment) inside the head node
as a part of the ray start command perhaps?
This way the client would be running as part of the cluster