If I bind all my dag node before submitting my script to Ray, why would I see new task getting added after ~100 tasks finish?
Can you share a script that represents the scenario? I don’t think I understood the question.
It is a bit difficult to share a minimal reproducible script - I will still give it a try but lemme describe what I am seeing in detail first.
Say I have a DAG, which has 167 tasks in total. When submitting to Ray, what I find is e.g. 130 tasks are shown in the dashboard (these 130 finish really fast). Then, I am starting to see the job in a “stuck” stage, after a while, the 131st task will appear and gets run. Eventually after a long time, all 167 tasks will finish but this looks off.
Ray 2.6.1
Python 3.10.12
A somewhat related q:
If I run:
import ray
@ray.remote
def calc1(x,y):
return x+y
@ray.remote
def calc2(x):
return x+1
dts = range(100)
outputs = {}
for _dt in dts:
outputs[_dt] = calc2.bind(calc1.bind(1,2))
@ray.remote
def join_for_dag(*args):
return
tasks = [out for out in outputs.values()]
final = join_for_dag.bind(tasks)
print(ray.get(final.execute()))
Why would I see the job marked as SUCCEEDED on the dashboard but with 190 tasks out of 201 actually failed?
re: the original question, I find my job runs perfectly after I switch from .bind to .remote. Might be a bug in ray dag.
Let me try repro next week!