Ray on multiple company collaborations

Hello, I am new to Ray and seek for a solution in the following scenario and want to see if Ray would be a great fit for this scenario.

Suppose there are multiple companies, say A and B (in reality, there could be more companies). They want to collaboratively compute something with there own data. For example, company A has input x_A and y_A and B has input u_B, v_B, and they are going to compute f(x_A, y_A, u_B, v_B) = (x_A + u_B) * (y_A + v_B).

According to slack discussion, there are two possible options for this problem using Ray cluster (Thanks to Ray Lover and Will Drevo for the suggestions).

  1. They build a common Ray cluster in cloud service provider such as AWS, Azure. Then, both companies use Ray client to pass in their data to this common cluster for further computation.

  2. Each of them has its own Ray cluster locally. Then, each of them can use its own Ray client to remotely access the other’s Ray cluster.

Both of them should work and I would like to see more options for this scenario if any.

Thanks in advance!
Alexander

Do you mind telling me the benefit of doing this instead of just putting data someplace and pulling it in the script.

Hi, @yic, the actual requirement is data privacy, company’s sensitive data cannot be outsourced to someplace, even a third party due to law/regulation, for computation.

can the data from company A be shared with company B and vice versa? or must the computation be completed in a privacy preserving fashion, i.e. A never gets to see B’s data in clear and B never gets to see A’s data in clear?

Thanks @alexanderzjs the requirement makes sense.

Maybe I misunderstood something, but I guess you want A, B to upload their data to someplace and they can’t see data the others uploaded. Once everything is ready, the system will just run the code. Is this the pattern you want to achieve?

If this is true, maybe

  1. They build a common Ray cluster in cloud service provider such as AWS, Azure. Then, both companies use Ray client to pass in their data to this common cluster for further computation.

will work.

For example, you can have a detached actor there, and it perhaps have API like:

class Algo:

   def upload(self, param, data):
        self._params.append([param, data])
        if everything_is_ready(self._params):
            self._result = f(self._params)
  def get_result(self):
       return self._result

But still, to prevent the data leak, you probably need other things like the long-running actor are running the right code and also there is no easy way to log into the cluster and hack the data in the object-store.

In my case, it should be the second case, data from either company should never be shared with the other and the computation should be in privacy preserving format.

Well, as I have explained in the previous reply to @bentay , all computations should be in privacy-preserving fashion. (The security assumption here is: each company trusts only itself and they do not trust third party since the third party may collude with one of the company.)

In this way, what they can do is to encrypt data/model parameters and send to the other party’s Ray cluster to do computation over encrypted data/model parameters. Therefore, I am seeking a way to do this.

For example, assume company A has x=10 and company B has y=20 (suppose x and y are already in encrypted form for simplicity). Company A has a Ray cluster and has already put x in its object store:

Now, company B need to retrieve x and compute x*y, so, B uses a Ray client to connect to A’s cluster and need to get x from Object Store like the following:

Not sure how it can be realized since B cannot directly lookup the object store for x_ref.

For this case, I think you need to store it in a detached actor with a name (a dict?), and B uses that actor to retrieve the object.

Would you mind to give a very concise code snippet to show it?

Should what I mention above work (class Algo)?

Hi @alexanderzjs , sorry for the delay reply.

your case is a basic MPC algorithm or a basic federated learning scenario.

I believe what you’re concerning is the complex and uncontrollable Ray protocols, like B is able to retrieve A’s data by using one of many Ray APIs(ray.get(), f.remote(), and more).

We had proposed rayfed , which is a connector layer to let user build federated learning or privacy-preserving computing applications on the top of Ray.
Also you could click our initial proposal page for more details on how we preserve privacy in Ray.

Hi, @jovany-wang, thanks for your reply.

Yes, I was looking for something that could integrate federated learning into Ray and I am happy to learn you guys have developed such framwork. As far as I know, there is another project GitHub - secretflow/secretflow: A unified framework for privacy-preserving data analysis and machine learning, which integrates MPC into ray to support privacy preserving applications.

Both are great and I appreciate your great contributions.

Yes. Now the Secretflow is built on RayFed.