Actor Scheduling Bug?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hey Team -

We have been using ray for a few years, fairly comfortable with it, but recently we have been experiencing an issue that we simply can’t solve for.

Essentially - all the below script is doing is testing how fast we can spin up a list of actors, run a trivial compute and finally print the time of execution.

I would say that 90% of the time, on the first run, we see the following:

‘The actor died unexpectedly before finishing this task.’

Shortly thereafter we see that a few actors may have been recreated correctly with the following statement logged:

‘(MyActor pid=141374, ip=10.51.191.60) [2024-02-20 14:56:33,892 E 141374 141374] actor_scheduling_queue.cc:86: client skipping requests 0 to 0’

All subsequent request to the newly created actors are able to measure the time and print the results without issue. Unfortuately - it would appear that there is an issue with scheduling this many actors in the system and expecting them to be immediately available without retry.

If we update the code to .wait() instead of .get(), it appears stable, albeit with a initial hit to perf.

Code example

#%%
import ray
import numpy as np
import time

# Initialize Ray
ray.init(address="<endpoint>")

@ray.remote
class MyActor:
    def __init__(self, size_mb):
        self.vector = np.random.rand(size_mb * (1024**2) // 8)

    def sum_vector(self):
        return sum(self.vector)

# Create actors and calculate sums
num_actors = 1024
size_mb = 10

#%%
# create actors
actors = [MyActor.options(name=f"TestActor{i}").remote(size_mb) for i in range(num_actors)]

#%%
# Measure time and print results
start_time = time.time()
futures = [actor.sum_vector.remote() for actor in actors]
results = ray.get(futures)
time_taken = time.time() - start_time
print(f"Calculated {num_actors} sums in {time_taken} seconds")

@neoearth

Can you check the ray logs to see if actors die? How big is your cluster? Is it possible that OOM happens?

It wouldn’t be at the node level. they are very large and plenty of resources. Is there somewhere else I could check?