Data transfer between nodes

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I would like to know how to measure time of data transfer between Ray nodes. Let’s say we have two nodes.

import numpy as np
import ray
import time

ray.init(address="auto")

data = np.zeros(1024 * 1024)
o_ref = ray.put(data) # say Ray saves the data in one node

@ray.remote
def foo(o_refs):
    t1 = time.time()
    data = ray.get(o_refs[0])
    t2 = time.time()
    print("time of data transfer between nodes =", t2 - t1)

foo.remote([o_ref]) # say Ray submits the task on other node which the data have to be transferred to

Is this a proper way to measure time of data transfer between nodes? Can I even measure retrieving the data from plasma using this in a single node case?

If not, could you please tell me the optimal way to measure time of data transfer?

Thanks in advance!

@Alex, @sangcho, thoughts/comments?

I think this could be a way. Here you pass a list of obj refs, so the task will start immediately, even before the object is transferred.

Keep in mind that if the o_ref is not ready, it will include the time of computation. But I don’t think it’s an issue in your script, since it’s an object created by put.

if the o_ref is not ready, it will include the time of computation

@yic, yes, that is a good point. In my scenario there is no such an issue. I wonder if there are some tools in Ray itself to measure the time of data transfer?

We don’t have a such tool yet. We are planning to do something (that we record the data transfer time for each task state as an output of ray list tasks). It will be highly likely added by ray 2.4 release.

Ray pulls the data when it is needed. So one workaround for now is

  1. Schedule foo in a remote node
  2. Measure time to take for ray.get(o_refs[0]). ray.get overhead is very small, so it is mostly the data transfer overhead.

@yic, @sangcho, got it, thank you!