Is there a Ray task limit?

adityatv · August 15, 2023, 6:11pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have submitted a job using ray start which creates over 10,000 tasks. When I monitor the status of my job with ray status and ray summary tasks, I can see it completing my tasks. However, it only completes 10,000 tasks and then stops executing tasks and hangs, even though the ray summary tasks output recognizes that I have more than 10,000 tasks. Is 10,000 tasks a hard limit for a job? Is there any way to work around it that isn’t just submitting several jobs? Thanks in advance!

Jules_Damji · August 16, 2023, 6:20pm

@adityatv Thanks for asking the question. 10K tasks at one time is pushing the limits. you may want to break into batches of tasks.

cc: @rickyyx @jjyao What’s our hard limit of max tasks in one single submission. Is it 10K

yic · August 16, 2023, 8:00pm

@adityatv do you have a simple script to reproduce this? I think we have stress tests for tasks, which looks fine.

adityatv · August 16, 2023, 8:20pm

I don’t have one currently, but after reducing my workload to less than 10K tasks, I ran across a similar issue where all the tasks (this time 4K) shown are marked as FINISHED as part of the ray summary tasks command, but the program would hang. The part of the code that is handling the task completion looks like this:

results_ref = [
    _execute_function_on_list.remote(
            func # function to execute
            arr, # array of inputs
           *args # parameters to func
        )
    for i in range(num_splits)
]
  
timeout = 30
while len(results_ref) > 0:
    ready_ref, results_ref = ray.wait(results_ref, num_returns=len(results_ref), timeout=timeout)
    ready_list = ray.get(ready_ref)
    for res in ready_list:
       # Update dict

print("Done")

I run this piece of code several times and it seems to hang after the second iteration when it marks all the tasks as FINISHED. So, the first iteration produces 2K tasks all of which complete properly and then the second iteration produces another 2K tasks which is marks as finished, but then doesn’t proceed past the while loop (i.e. doesn’t print a second done).

All that being said, I’m wondering if maybe this isn’t an issue with the 10K number but some other issue. I’ll try to come up with a repro script

adityatv · August 19, 2023, 3:09pm

So, I don’t have a repro script yet, but one other thing to add regarding my program is that in between the iterations (e.g. after I print Done) I actually save my dictionary to a file and that takes several hours as it can be over a 100 GB. Once the file was written and the second iteration of tasks started the program failed cause of the following:

Failed to retrieve object 89aee05378ca965dffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and ray.init().

The object’s owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (/tmp/ray/session_latest/logs/*01000000ffffffffffffffffffffffffffffffffffffffffffffffff* at IP address 10.4.5.1) for more information about the Python worker failure.

When I look at the appropriate log file I see the following issue:

[2023-08-18 11:06:34,649 W 169474 169638] reference_count.cc:1559: Object locations requested for 4b3161e62d4b261effffffffffffffffffffffff0100000001000000, but ref already removed. This may be a bug in the distributed reference counting protocol.
[2023-08-18 11:06:34,649 W 169474 169638] reference_count.cc:1559: Object locations requested for 5b48228d017ae908ffffffffffffffffffffffff0100000001000000, but ref already removed. This may be a bug in the distributed reference counting protocol.

Is there some so sort of time limit for how long a program can without using a Ray command before things get evicted?

dragongu · April 6, 2025, 1:21pm

May I ask if there is still a limit of 10k in the latest version? What will happen if it exceeds 10k

Zrant_Andrew · May 14, 2025, 3:48am

i also have same question. if task nums > 10k,dashboard can’t get whole job information. This will affect the user’s usage. For instance, if a user starts 20,000 tasks and the dashboard shows 10,000 tasks, including 8,000 finished ones, only 2,000 tasks can be seen as running. In reality, there should still be 12,000 tasks unfinished, and the number of running tasks should be greater than 2,000. My question is why 1W is set here. Will it have any impact if it is greater than 1W @yic

Topic		Replies	Views
Confusion around Ray Core task limit Ray Core	3	83	March 13, 2025
Ray Task count? Dashboard, Monitoring & Debugging	10	1251	March 30, 2023
Seeing weirdness when running Ray Dag Ray Core	5	286	August 25, 2023
Ray stuck at the number of tasks reaching to 10000 Ray Core	3	196	May 29, 2024
Limit resources for a group of tasks Ray Core	5	410	November 23, 2022

Is there a Ray task limit?

Related topics