How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
If I have 200 jobs to run, 100 A jobs and 100 B jobs, both require 1 GPU per job. each B has a corresponding pre-requisite A job to finish first. Say I have 10 GPUs in total. Is there any mechanism to make sure that the gpu resources are not idle? In the extreme case, that ray takes all 10 GPUs and assign to B, but B can’t begin since the corresponding As are not finished yet?
Assuming the way I submit job is regardless of job priorities and dependencies, just throw them all to ray cluster all together without ordering
Btw, note that the job means Ray driver in Ray (Ray Jobs Overview — Ray 3.0.0.dev0). We use tasks for this term.
This should just work out of the box. Ray automatically schedules a new task if there’s the resource availability. So when one task (either A or B) is done, it should schedule the next task. Unless your task B requires multiple A to be done, your GPU should be always fully used unless there’s less than 10 tasks left from the cluster.
Thanks, what if the 10 GPUs are processing 10 B tasks, whose A tasks are still in the queue. These 10 B tasks would always be in running state, right?
If I understand your question correctly, B cannot run without A right? It is like
all_a = [a.remote() for _ in range(100)]
all_b = 
for a_ref in all_a:
10 GPUs are processing 10 B tasks, whose A tasks are still in the queue
→ this is not possible in the first place? (B can only run when A is completed)
Similar, it is more like below:
self.pre_requisite_task = a
Hmm that code is not executable? I just would like to emphasize that if the reference is passed to other remote task, those tasks are not scheduled until the upstream dependency is completed. Like
a_ref = a.remote()
# In this case, b wouldn't be scheduled until a is completed
b_ref = b.remote(a_ref)
Thanks, but how does ray know b is dependent on a?
If it runs b without a, it can run okay without any error (though it didn’t run the code of