How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have submitted a job using ray start which creates over 10,000 tasks. When I monitor the status of my job with ray status and ray summary tasks, I can see it completing my tasks. However, it only completes 10,000 tasks and then stops executing tasks and hangs, even though the ray summary tasks output recognizes that I have more than 10,000 tasks. Is 10,000 tasks a hard limit for a job? Is there any way to work around it that isn’t just submitting several jobs? Thanks in advance!
I don’t have one currently, but after reducing my workload to less than 10K tasks, I ran across a similar issue where all the tasks (this time 4K) shown are marked as FINISHED as part of the ray summary tasks command, but the program would hang. The part of the code that is handling the task completion looks like this:
results_ref = [
_execute_function_on_list.remote(
func # function to execute
arr, # array of inputs
*args # parameters to func
)
for i in range(num_splits)
]
timeout = 30
while len(results_ref) > 0:
ready_ref, results_ref = ray.wait(results_ref, num_returns=len(results_ref), timeout=timeout)
ready_list = ray.get(ready_ref)
for res in ready_list:
# Update dict
print("Done")
I run this piece of code several times and it seems to hang after the second iteration when it marks all the tasks as FINISHED. So, the first iteration produces 2K tasks all of which complete properly and then the second iteration produces another 2K tasks which is marks as finished, but then doesn’t proceed past the while loop (i.e. doesn’t print a second done).
All that being said, I’m wondering if maybe this isn’t an issue with the 10K number but some other issue. I’ll try to come up with a repro script
So, I don’t have a repro script yet, but one other thing to add regarding my program is that in between the iterations (e.g. after I print Done) I actually save my dictionary to a file and that takes several hours as it can be over a 100 GB. Once the file was written and the second iteration of tasks started the program failed cause of the following:
Failed to retrieve object 89aee05378ca965dffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and ray.init().
The object’s owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (/tmp/ray/session_latest/logs/*01000000ffffffffffffffffffffffffffffffffffffffffffffffff* at IP address 10.4.5.1) for more information about the Python worker failure.
When I look at the appropriate log file I see the following issue:
[2023-08-18 11:06:34,649 W 169474 169638] reference_count.cc:1559: Object locations requested for 4b3161e62d4b261effffffffffffffffffffffff0100000001000000, but ref already removed. This may be a bug in the distributed reference counting protocol.
[2023-08-18 11:06:34,649 W 169474 169638] reference_count.cc:1559: Object locations requested for 5b48228d017ae908ffffffffffffffffffffffff0100000001000000, but ref already removed. This may be a bug in the distributed reference counting protocol.
Is there some so sort of time limit for how long a program can without using a Ray command before things get evicted?