Possible causes of ray.get timeout overshoot

I am using ray with a timeout which works most of the time. However, it occasionally (something like once per million calls to get) takes significantly longer than requested. I am giving it 38 ms and most recently it took 100 ms instead. Most of the time, it returns within 1 ms of the requests timeout. What are the possible causes for an occasional timeout overrun? network issues communicating with redis telling it to remove data? non-real time OS not returning control to ray in time? Any other ideas?

The code I used to detect this problem is:

    remote = foo.remote()

    try:
        start = time.perf_counter()
        ray.get([remote], timeout=timeout)
    except GetTimeoutError:
        time_taken = time.perf_counter() - start

        if time_taken > timeout * 2:
            print(time_taken, timeout)

Also note that I have the python garbage collector disabled during this process. The system has 12 cores, many of them not used and has ample free memory. Is there anything else that would be helpful in figuring out why this is occurring?

hi @dwiel, do you have a high level overview of what application you are running with Ray? This could help use narrow the possible cause. At a high level the ray.get does following:

  1. Wait for the task being scheduled and executed;
  2. depends on the object size, the result could be putted into plasma store or being inlined as part of RPC call once the task finished execution;
  3. if the object can’t be inlined in RPC, it might need to first allocate enough space in the plasma, which might requires GC and object spilling;
  4. If the task is executed on a different node other than the node where ray.get(object) is called, the object is also need to be transferred between these nodes; and the dest node also allocate enough space to receive the object being transferred.
  5. Finally, once the object is transferred to the local plasma store, there is a cross process RPC via unix domain socket to get the object’s address from local plasma store to the python process where ray.get is called.

Depending on your deployment and workload, each of these steps could take a toll on your get performance.

Thanks for the details on what is going on under the hood in a get call. I looked through the code some, but it is helpful to have the high level overview to guide that reading.

The high level workload is a neural network inference service. I have a single on prem machine with 12 cores (24 virtual) as well as a recent NVIDIA GPU. the workload result I am getting is primarily pytorch using the GPU. A few of the CPU cores are also being used by other processes (including redis), but they are constrained to use at most 12 of the 24 CPUs and have statically assigned CPU affinities. There is no other workload on the GPU. I have the python garbage collector disabled during this call to ray.get

As a basic test, I ran the following code over night. It isn’t the same workload since I wanted it to be reproducible for others.

import random
import time

import numpy as np
import ray
from ray.exceptions import GetTimeoutError

mean_time = 0.04


@ray.remote
def foo():
    time.sleep(mean_time + random.uniform(-0.02, 0.02))


ray.init()

times = []
while True:
    remote = foo.remote()

    try:
        start = time.perf_counter()
        ray.get([remote], timeout=mean_time)
    except GetTimeoutError:
        pass

    time_taken = time.perf_counter() - start

    if time_taken > mean_time * 1.1:
        print(time_taken, mean_time)

    times.append(time_taken)

    if len(times) % 1000 == 0:
        print(
            len(times),
            np.percentile(
                times,
                (
                    50,
                    75,
                    90,
                    95,
                    99,
                    99.9,
                    99.99,
                    99.999,
                    99.9999,
                    99.99999,
                    99.999999,
                    99.9999999,
                    99.99999999,
                    99.999999999,
                ),
            ),
        )

1215000 [0.04018716 0.04024355 0.04028641 0.04031676 0.04038924 0.04047388 0.04091035 0.0595979 0.06972685 0.07359515 0.07406085 0.07410742 0.07411208 0.07411254]

After 1.2m calls, it is clear that 99% of the time, the call returns within 1% of the requested timeout, 99.99% of the time the get call returns within 2.5% of the requested timeout, but it gets as bad as taking almost double the requested timeout in the worst case. In practice on my workload, I’ve seen it occasionally take nearly 3x the requested timeout.

Perhaps I would be better off waiting until near the end of the timeout and making a get request with a timeout of 0? I’m afraid though that the source of the delay is not in the loop watched by the timeout and so this wouldn’t have any practical effect.

There are a few other services also using redis so perhaps this is a possible cause of delay?

@dwiel Thanks for your reply. Reading your setup we can eliminate the network from the picture.

  • One possibilities is that even though you disabled python GC but it could still being called by the internal component;
  • Another thing I’d guess it’s scheduling delay in the scheduler.
  • If you allocate lots of plasma object like large models Plasma GC/Spilling delay is possible too.

I’d suggest the best way to figure it out is by enabling debugging logs. setting env RAY_BACKEND_LOG_LEVEL=debug and read the logs around the time when those spiky timeout happens; this will give a clear hints what slows down the specific get requests.

At a high level though, usually it’s very hard to ensure all requests finishes in a given timeout for a multi-threaded/distributed system. That’s why most of those systems measures p99 performances.

Hope this helps and let’s know if you have more questions.