How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
In Ray project, I find a few protocols about networks like gRPCand boost::asio. I know Plasma is import part of the data manager. Besides data transfer, remote functions and actors need to be scheduled, which will also be transferred in the ray cluster. The protocols confuse me.
Someone can tell me all these protocols are used for what situation. Thanks.
Hey @xyzyx , happy to clarify these for you further. You listed a few different things here, just want to know first what exactly are the protocols that you want to know more about? Would you mind sharing a bit more about your question?
Yeah, I’m interested in how data is transferred in ray. A program usually needs two parts, data and function. How do function and data transfer in ray cluster? As show in whitepaper, objects store in nodes and transfer between nodes. How does the transfer implement?
I try to start ray using InfiniBand by specifying the address and get speed up. But the way doesn’t make full use of IB. I want to clarify how it works and optimize ray in IB. I think it is meaningful for clusters with IB.
If I want to do so, can you give me some advice? Can this work merge with ray project?
How do function and data transfer in ray cluster? As show in whitepaper , objects store in nodes and transfer between nodes. How does the transfer implement?
Assuming you are asking about code specific questions: object transfers are handled by the object manager and relevant classses (e.g. PullManager, PushManager). And they interact with a distributed store, Plasma store underneath.
I try to start ray using InfiniBand by specifying the address and get speed up. But the way doesn’t make full use of IB. I want to clarify how it works and optimize ray in IB. I think it is meaningful for clusters with IB.
If I want to do so, can you give me some advice? Can this work merge with ray project?
Would you mind sharing a bit more details on what the issues are w.r.t. ray not making full use of IB? Is it not performing on a throughput that you expected? Wondering if you are aware of anything that ray could have done to make better use of IB already, if so, that would be awesome.
I find this in ray’s architecture whitepaper. Ray is built on top of gRPC. If so, Can I optimize gRPC with IB to accelerate Ray?
Yes, Ray uses gRPC for cross processes communication. However, personally I am not aware how gRPC could work with IB.
The easiest way to use IB is set IB address when ray start head. The command looks like ray start --head --node-address=IBaddress. In this way, ray can use IB with high throughput and low latency.
But CPUs are still used to process network requests. RDMA can avoid this and CPUs can focus on computing tasks without interruption. I want to make full use of RDMA to maximize the usage percent of CPUs.
I seem, thanks for the elaboration. So just to make sure I understand, the requirements are:
Allowing ray to be configured with a IBaddress, which I believe Ray currently could with --node-ip-address? Or if this doesn’t work for you still?
Configuring ray with RDMA: I am not entirely sure how this could be done on Ray yet. If you know how to do it, please give me some pointers to look at. Otherwise, I will figure out with someone on the team.
Ray is a sophisticated project. There are many lines of code. As for ray core, it already contains 300k lines of code. If I want to understand how these codes are organized, can you give me some advice or some figures I can refer to? This will help me focus on work about RDMA.