Confusion around Ray Core task limit

I am still somewhat new to ray/ray-core and currently confused about potential limits to the amount of tasks submitted in one job and across all jobs running on a ray cluster.

I’ve first practically come across this after submitting a job with >1M tasks and noticing significant slowdown over debugging runs with just a few thousand tasks (using ray[default] 2.40.0).

The information I found on the topic, however, is a bit scarce and partially outdated as far as I can tell. To the best of my knowledge, what is clear is that the dashboard interface as well as the ray task API has a limit of 10k tasks to retrieve and track. This is a bit confusing as the dashboard does not show a warning, but otherwise I don’t mind.

Regarding task submissions and the GCS I found varying numbers and limits.
I this (somewhat outdated) issue, links are provided to ray config files stating a 10k limit in the dashboard and a 100k limit in the GCS backend. In this older post, it is mentioned that this 100k limit is a global limit and the per-job limit is actually 10k in GCS.

Now, looking at up-to-date versions of these config files, I do not see such restrictions or config variables anymore. Some in-line comments suggest that submitting 100k+ tasks are fine. Practically, I was able to run some toy jobs with 100k tasks but experience slow down with larger numbers.

So, what is the up-to-date info on this?

  • Is there a task limit in the GCS per job and/or node-wide?
  • Or do you simply run into memory issues when submitting more?
  • Or are there other constraints that make task submission/queue slow after hitting some limit?
  • What is the recommended way to handle a large amount (say between 100k and 1M) tasks? Batching is a bit annoying when dealing with failed tasks.

Running 100k+ tasks should be possible in a large-enough (e.g. 100+ node) cluster with a beefy head node in my opinion.

Any help and/or links to configs/docs is appreciated :slight_smile:

1 Like

I also now came across this benchmark readme, mentioning 1M+ tasks in a single node and 10k+ simultanuous tasks in the distributed setup so I am wondering why I was observing severe slowdown with ca 3M tasks prviously…

Hi @MaxSchambach

So you submitted 3M tasks and experienced severe slowdown? Can you tell me more about your env such as the size of the cluster

Yes, it’s hard to tell exactly as the dashboard doesn’t show the tasks and progress anymore. But first testing with <10k tasks my local logging and the cluster utilization suggested that everything worked as expected but submitting the full workload with 3M tasks, utilization suggested that nothing was happening after submitting the future object refs.

This was running on a 32-node cluster with rather small worker nodes, I think just 2CPUs with 16GB RAM per node, but a larger head node, a 16 CPU 64GB RAM node.