Actor Scheduling Bug?

neoearth · February 20, 2024, 8:06pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hey Team -

We have been using ray for a few years, fairly comfortable with it, but recently we have been experiencing an issue that we simply can’t solve for.

Essentially - all the below script is doing is testing how fast we can spin up a list of actors, run a trivial compute and finally print the time of execution.

I would say that 90% of the time, on the first run, we see the following:

‘The actor died unexpectedly before finishing this task.’

Shortly thereafter we see that a few actors may have been recreated correctly with the following statement logged:

‘(MyActor pid=141374, ip=10.51.191.60) [2024-02-20 14:56:33,892 E 141374 141374] actor_scheduling_queue.cc:86: client skipping requests 0 to 0’

All subsequent request to the newly created actors are able to measure the time and print the results without issue. Unfortuately - it would appear that there is an issue with scheduling this many actors in the system and expecting them to be immediately available without retry.

If we update the code to .wait() instead of .get(), it appears stable, albeit with a initial hit to perf.

Code example

#%%
import ray
import numpy as np
import time

# Initialize Ray
ray.init(address="<endpoint>")

@ray.remote
class MyActor:
    def __init__(self, size_mb):
        self.vector = np.random.rand(size_mb * (1024**2) // 8)

    def sum_vector(self):
        return sum(self.vector)

# Create actors and calculate sums
num_actors = 1024
size_mb = 10

#%%
# create actors
actors = [MyActor.options(name=f"TestActor{i}").remote(size_mb) for i in range(num_actors)]

#%%
# Measure time and print results
start_time = time.time()
futures = [actor.sum_vector.remote() for actor in actors]
results = ray.get(futures)
time_taken = time.time() - start_time
print(f"Calculated {num_actors} sums in {time_taken} seconds")

jjyao · February 22, 2024, 6:53am

@neoearth

Can you check the ray logs to see if actors die? How big is your cluster? Is it possible that OOM happens?

neoearth · February 27, 2024, 8:46pm

It wouldn’t be at the node level. they are very large and plenty of resources. Is there somewhere else I could check?

Topic		Replies	Views
Ray Actor Dying unexpectedly Ray Core	8	3761	October 21, 2022
Ray jobs failing after 250 jobs Ray Clusters	0	266	February 27, 2023
Unstable actors on GPU	4	236	October 10, 2024
Ray actors are killed unexpected Ray Core	2	257	December 24, 2023
Previously well running script does not allocate resources correctly anymore Ray Core	2	378	December 6, 2021

Actor Scheduling Bug?

Related topics