How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
If I have 200 jobs to run, 100 A jobs and 100 B jobs, both require 1 GPU per job. each B has a corresponding pre-requisite A job to finish first. Say I have 10 GPUs in total. Is there any mechanism to make sure that the gpu resources are not idle? In the extreme case, that ray takes all 10 GPUs and assign to B, but B can’t begin since the corresponding As are not finished yet?
Assuming the way I submit job is regardless of job priorities and dependencies, just throw them all to ray cluster all together without ordering
This should just work out of the box. Ray automatically schedules a new task if there’s the resource availability. So when one task (either A or B) is done, it should schedule the next task. Unless your task B requires multiple A to be done, your GPU should be always fully used unless there’s less than 10 tasks left from the cluster.
Hmm that code is not executable? I just would like to emphasize that if the reference is passed to other remote task, those tasks are not scheduled until the upstream dependency is completed. Like
a_ref = a.remote()
# In this case, b wouldn't be scheduled until a is completed
b_ref = b.remote(a_ref)
Thanks, but how does ray know b is dependent on a?
If it runs b without a, it can run okay without any error (though it didn’t run the code of do_something_in_b)