Seeing weirdness when running Ray Dag

If I bind all my dag node before submitting my script to Ray, why would I see new task getting added after ~100 tasks finish?

Can you share a script that represents the scenario? I don’t think I understood the question.

It is a bit difficult to share a minimal reproducible script - I will still give it a try but lemme describe what I am seeing in detail first.

Say I have a DAG, which has 167 tasks in total. When submitting to Ray, what I find is e.g. 130 tasks are shown in the dashboard (these 130 finish really fast). Then, I am starting to see the job in a “stuck” stage, after a while, the 131st task will appear and gets run. Eventually after a long time, all 167 tasks will finish but this looks off.

Ray 2.6.1
Python 3.10.12

A somewhat related q:
If I run:

import ray

@ray.remote
def calc1(x,y):
    return x+y


@ray.remote 
def calc2(x):
    return x+1

dts = range(100)
outputs = {}

for _dt in dts:
    outputs[_dt] = calc2.bind(calc1.bind(1,2))


@ray.remote
def join_for_dag(*args):
    return 

tasks = [out for out in outputs.values()]


final = join_for_dag.bind(tasks)

print(ray.get(final.execute()))

Why would I see the job marked as SUCCEEDED on the dashboard but with 190 tasks out of 201 actually failed?

re: the original question, I find my job runs perfectly after I switch from .bind to .remote. Might be a bug in ray dag.

1 Like

Let me try repro next week!

1 Like