How does ray pick which job to run when there are more jobs than available resources

Hello. I’ve been trying to find information about how Ray picks the “next” job to execute when there are multiple jobs in the queue that are submitted by multiple workflows.

For example, Let’s say I submit a workflow with 1000 jobs, and while those 1000 jobs are being executed I submit another 100 jobs. It looks like Ray will not wait for the first 1000 jobs to finish but it starts executing some of the 100 jobs that I’ve submitted concurrently. Does Ray pick next job randomly from the queue? If it picks jobs randomly, is there way to “penalize” the priorities of the first 1000 jobs so that other jobs will be more likely to be processed? Can I configure this behavior somehow via configuration files?

There isn’t a way to specifically configure priority, but it is possible to wait for the first batch to complete before submitting the second. Once it is submitted, there is not a way to penalize and prioritize for what is later submitted.

import ray

@ray.remote
def one():
    pass

@ray.remote
def two():
    pass


ones = [one.remote() for i in range(3)]

ray.get(ones)
print("finished ones")

twos = [two.remote() for i in range(3)]

ray.get(twos)
print("finished twos")

Thanks for the tip! @ClarenceNg

If I don’t wait for the first batch to complete and submit the second batch, how does ray handle the ordering of job execution? Will ray start processing the second batch before completing all first jobs?

There are factors at play that affects when a task completes, like the variance in the processing time, whether task is retried, etc. Even in the perfect scenario we don’t guarantee specific ordering of execution once the task is submitted. It is possible that they are executed in FIFO but that is not a guarantee that is provided. If the tasks in the two batches are identical it is very possible that it executes in FIFO order.

To wait for completion the best bet is to look into functions like ray.get or ray.wait

@ClarenceNg I actually don’t want FIFO. What I want is for ray to provide “fair execution” of multiple workflows submitted more/less at the same time. For example, I am using ray as a backend to run workflows submitted from a multi-user web application. When multiple users submit workflows simultaneously, all requests will be more/less submitted concurrently to a single ray cluster. I want all workflows to be executed concurrently without a single user taking over the whole cluster for extended period of time. I definitely don’t want FIFO in this case as that means all users will have to wait for other users’ jobs to finish - especially if first user submits a large job and other users jobs are much smaller.

Am I correct to assume that ray itself does not provide such “fair execution” execution policy? I am familiar with using batch schedulers like slurm, htcondor that handles multi-user environment, but I am wondering if there is anything I can use on-top-of ray itself to make such “fair execution” possible. Has anyone worked on a layer on top of lay to make that happen?? Ideally, I’d like to be able to configure ray’s internal scheduling system, but is ray strictly designed for a single user environment?