Hello, I am new to Ray and seek for a solution in the following scenario and want to see if Ray would be a great fit for this scenario.
Suppose there are multiple companies, say A and B (in reality, there could be more companies). They want to collaboratively compute something with there own data. For example, company A has input x_A and y_A and B has input u_B, v_B, and they are going to compute f(x_A, y_A, u_B, v_B) = (x_A + u_B) * (y_A + v_B).
According to slack discussion, there are two possible options for this problem using Ray cluster (Thanks to Ray Lover and Will Drevo for the suggestions).
They build a common Ray cluster in cloud service provider such as AWS, Azure. Then, both companies use Ray client to pass in their data to this common cluster for further computation.
Each of them has its own Ray cluster locally. Then, each of them can use its own Ray client to remotely access the other’s Ray cluster.
Both of them should work and I would like to see more options for this scenario if any.
Thanks in advance!
Do you mind telling me the benefit of doing this instead of just putting data someplace and pulling it in the script.
Hi, @yic, the actual requirement is data privacy, company’s sensitive data cannot be outsourced to someplace, even a third party due to law/regulation, for computation.
can the data from company A be shared with company B and vice versa? or must the computation be completed in a privacy preserving fashion, i.e. A never gets to see B’s data in clear and B never gets to see A’s data in clear?
Thanks @alexanderzjs the requirement makes sense.
Maybe I misunderstood something, but I guess you want A, B to upload their data to someplace and they can’t see data the others uploaded. Once everything is ready, the system will just run the code. Is this the pattern you want to achieve?
If this is true, maybe
- They build a common Ray cluster in cloud service provider such as AWS, Azure. Then, both companies use Ray client to pass in their data to this common cluster for further computation.
For example, you can have a detached actor there, and it perhaps have API like:
def upload(self, param, data):
self._result = f(self._params)
But still, to prevent the data leak, you probably need other things like the long-running actor are running the right code and also there is no easy way to log into the cluster and hack the data in the object-store.
In my case, it should be the second case, data from either company should never be shared with the other and the computation should be in privacy preserving format.
Well, as I have explained in the previous reply to @bentay , all computations should be in privacy-preserving fashion. (The security assumption here is: each company trusts only itself and they do not trust third party since the third party may collude with one of the company.)
In this way, what they can do is to encrypt data/model parameters and send to the other party’s Ray cluster to do computation over encrypted data/model parameters. Therefore, I am seeking a way to do this.
For example, assume company A has
x=10 and company B has
y are already in encrypted form for simplicity). Company A has a Ray cluster and has already put
x in its object store:
Now, company B need to retrieve
x and compute
x*y, so, B uses a Ray client to connect to A’s cluster and need to get
x from Object Store like the following:
Not sure how it can be realized since B cannot directly lookup the object store for x_ref.
For this case, I think you need to store it in a detached actor with a name (a dict?), and B uses that actor to retrieve the object.
Would you mind to give a very concise code snippet to show it?
Should what I mention above work (