Data transfer between nodes

YarShev · January 7, 2023, 10:18pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I would like to know how to measure time of data transfer between Ray nodes. Let’s say we have two nodes.

import numpy as np
import ray
import time

ray.init(address="auto")

data = np.zeros(1024 * 1024)
o_ref = ray.put(data) # say Ray saves the data in one node

@ray.remote
def foo(o_refs):
    t1 = time.time()
    data = ray.get(o_refs[0])
    t2 = time.time()
    print("time of data transfer between nodes =", t2 - t1)

foo.remote([o_ref]) # say Ray submits the task on other node which the data have to be transferred to

Is this a proper way to measure time of data transfer between nodes? Can I even measure retrieving the data from plasma using this in a single node case?

If not, could you please tell me the optimal way to measure time of data transfer?

Thanks in advance!

YarShev · January 9, 2023, 9:05pm

@Alex, @sangcho, thoughts/comments?

yic · January 9, 2023, 9:15pm

I think this could be a way. Here you pass a list of obj refs, so the task will start immediately, even before the object is transferred.

Keep in mind that if the o_ref is not ready, it will include the time of computation. But I don’t think it’s an issue in your script, since it’s an object created by put.

YarShev · January 9, 2023, 9:28pm

if the o_ref is not ready, it will include the time of computation

@yic, yes, that is a good point. In my scenario there is no such an issue. I wonder if there are some tools in Ray itself to measure the time of data transfer?

sangcho · January 10, 2023, 2:10am

We don’t have a such tool yet. We are planning to do something (that we record the data transfer time for each task state as an output of ray list tasks). It will be highly likely added by ray 2.4 release.

Ray pulls the data when it is needed. So one workaround for now is

Schedule foo in a remote node
Measure time to take for ray.get(o_refs[0]). ray.get overhead is very small, so it is mostly the data transfer overhead.

YarShev · January 10, 2023, 3:14pm

@yic, @sangcho, got it, thank you!

Topic		Replies	Views
How `ray.wait()` works Ray Core	2	443	August 22, 2022
Ray k8s cluster, communication is slow Ray Core	15	1068	June 18, 2022
How to speed up ray.get() to get a large object from another node？ Ray Core	6	189	June 5, 2024
Data Locality mentioned in Jules Damji talks Ray Core	2	116	April 11, 2024
Fetching an object from remote memory Ray Core	0	297	June 1, 2021

Data transfer between nodes

Related topics