Run time of remote function depends on time elapsed since the last remote function call

I have some questions about the run time of my code. My code follows below.

``````# python 3.7.12
# ray 1.9.2
import ray
import time

# Check if ray.init has been called yet.
if not ray.is_initialized():
# measure initialization time
start_time = time.perf_counter()

# Connect to an existing Ray cluster or start one and connect to it.
ray.init(num_cpus=1,
include_dashboard=False)

duration = time.perf_counter() - start_time
print("Ray initialized!\nDuration: {:0.3f}\n".format(duration))

else:

@ray.remote(num_returns=1, num_cpus=0.1, max_calls=0)
def inner_func():
time.sleep(2)
return 1

@ray.remote(num_returns=1, num_cpus=0.1, max_calls=0)
def outer_func():
# do inner loop
list_values = ray.get([inner_func.remote() for _ in range(25)])

# do stuff
pass

for i in range(6):
print("i={}".format(i))
if i in [0, 1, 2, 5]:
time.sleep(5)
print("Sleep")
else:
print("No sleep")

start_time = time.perf_counter()
ray.get(outer_func.remote())
print("Duration: {:0.3f}\n".format(time.perf_counter() - start_time))
``````

Code description:
In my code, I initialize Ray first. I used 1 CPU for the test. After initialization, I define two functions, both with the remote decorator. These functions are used in a nested for loop. This nested for loop consists of calling outer_func() 6 times. In the function outer_func(), the function inner_func() is called 25 times. So the nested for loop is: do 6 times a for loop of 25. For the outer loop, a sleep of 5 seconds is done before the function is called for iterations 0, 1, 2, and 5. So no sleep for iterations 3 and 4.

Ray initialized!
Duration: 4.775

i=0
Sleep
Duration: 17.848

i=1
Sleep
Duration: 9.848

i=2
Sleep
Duration: 9.842

i=3
No sleep
Duration: 7.559

i=4
No sleep
Duration: 7.830

i=5
Sleep
Duration: 9.876

My observations:

1. The first time the outer_func() is called, it takes about 2 times as long as for the other times.
2. For iterations 1, 2, and 5 it takes about 10 seconds, while for iterations 3 and 4 it takes about 8 seconds. This is remarkable because prior to iterations 3 and 4 no sleep is done.

My hypotheses:

1. The first iteration takes longer because Ray is still starting up.
2. The sleep causes Ray to become inactive. As a result, after sleeping, it takes about 2 seconds longer to execute outer_func(). This is the time it takes to make Ray active again.

I would like to hear the reason for this behavior, if my hypotheses are correct and if there is documentation on this behavior.

1 Like
1. The first iteration takes longer because Ray is still starting up.

This is correct. Basically, ray starts as many worker procs as num_cpus, meaning you have only 1 process after Ray is initialized. Since your task requires 0.1 CPU, you need total 10 worker processes, so it would be delayed as much as waiting for 10 processes to be started.

1. For iterations 1, 2, and 5 it takes about 10 seconds, while for iterations 3 and 4 it takes about 8 seconds. This is remarkable because prior to iterations 3 and 4 no sleep is done.

I am not quite sure, but here’s my hypothesis

Ray has soft limit of number of processes. The soft limit is equivalent to num_cpus by default. If there are more than soft limit # of processes, Ray kills them when they are idle. That said, if you have 5 seconds sleep, all of your worker processes will be killed (9 of them), and you would have delay as much as starting 9 worker processes. If this is the case, if you set num_cpus=10 by default, and if you requires 1 CPU instead of 0.1 for each task, it might have consistent performance.

2 Likes

Hi @sangcho ,

Thank you!
I’m working with Ted on the same project where we are trying to use Ray to parallelize lots of things. Our first goal is to understand Ray and what we see in simple tests like this.

I know Ted is currently on holiday from his studies. But he will be back in a week or so. When he is back, we will try what you suggest. It sounds interesting and we will report back our findings.

Thanks again!

1 Like

Thanks! Feel free to follow up or reach out to me through slack (sangcho). I will also suggest to improve documentation here for the existing behavior

2 Likes

Hello @sangcho,

Many thanks for your reply. I will try to test your hypothesis this coming week and I will post my findings in this thread.

1 Like

I have tested the proposal of @sangcho. My new results are partly in line with his first answer and are in line with his second answer. By setting num_cpus=10 (this was 1) and changing the required num_cpus per remote function to 1 (this was 0.1), the run time is not influenced by the sleep of 5 seconds. My changed code follows below (every changed line contains the comment “# changed: …”.

``````# python 3.7.12
# ray 1.9.2
import ray
import time

# Check if ray.init has been called yet.
if not ray.is_initialized():
# measure initialization time
start_time = time.perf_counter()

# Connect to an existing Ray cluster or start one and connect to it.
ray.init(num_cpus=10,   # changed: "num_cpus=1"
include_dashboard=False)

duration = time.perf_counter() - start_time
print("Ray initialized!\nDuration: {:0.3f}\n".format(duration))

else:

@ray.remote(num_returns=1, num_cpus=1, max_calls=0)   # changed: "num_cpus=0.1"
def inner_func():
time.sleep(2)
return 1

@ray.remote(num_returns=1, num_cpus=1, max_calls=0)   # changed: "num_cpus=0.1"
def outer_func():
# do inner loop
list_values = ray.get([inner_func.remote() for _ in range(25)])

# do stuff
pass

for i in range(6):
print("i={}".format(i))
if i in [0, 1, 2, 5]:
time.sleep(5)
print("Sleep")
else:
print("No sleep")

start_time = time.perf_counter()
ray.get(outer_func.remote())
print("Duration: {:0.3f}\n".format(time.perf_counter() - start_time))
``````

Ray initialized!
Duration: 4.697

i=0
Sleep
Duration: 9.404

i=1
Sleep
Duration: 6.027

i=2
Sleep
Duration: 6.022

i=3
No sleep
Duration: 6.020

i=4
No sleep
Duration: 6.024

i=5
Sleep
Duration: 6.023

My observations:

1. The first iteration is still longer than the other iterations, but now only 3-4 seconds instead of 8 seconds (note: run time of the iterations is a bit shorter now). This could indicate that there is still something that has to be initialized or set when Ray is used for the first time. So this is partly in line with the first answer/hypothesis of @sangcho because apparently, starting the 9 extra workers is not the only thing that has to be done when Ray is used for the first time.
2. The run time is not influenced by a sleep because iterations 1, 2, 3, 4, and 5 all have a run time of about 6 seconds. So this is in line with the second answer/hypothesis of @sangcho.
3. The overhead is very small. There are 10 workers while every outer loop has to do 25 inner loops. So every iteration has 26 tasks to do (1 outer loop + 25 inner loops). The bottleneck is the sleep of 2 seconds in the inner loop. ceiling(25/10) = 3 so the minimal run time is at least 3*2 = 6 seconds.
1 Like