I am still somewhat new to ray/ray-core and currently confused about potential limits to the amount of tasks submitted in one job and across all jobs running on a ray cluster.
I’ve first practically come across this after submitting a job with >1M tasks and noticing significant slowdown over debugging runs with just a few thousand tasks (using ray[default] 2.40.0).
The information I found on the topic, however, is a bit scarce and partially outdated as far as I can tell. To the best of my knowledge, what is clear is that the dashboard interface as well as the ray task API has a limit of 10k tasks to retrieve and track. This is a bit confusing as the dashboard does not show a warning, but otherwise I don’t mind.
Regarding task submissions and the GCS I found varying numbers and limits.
I this (somewhat outdated) issue, links are provided to ray config files stating a 10k limit in the dashboard and a 100k limit in the GCS backend. In this older post, it is mentioned that this 100k limit is a global limit and the per-job limit is actually 10k in GCS.
Now, looking at up-to-date versions of these config files, I do not see such restrictions or config variables anymore. Some in-line comments suggest that submitting 100k+ tasks are fine. Practically, I was able to run some toy jobs with 100k tasks but experience slow down with larger numbers.
So, what is the up-to-date info on this?
- Is there a task limit in the GCS per job and/or node-wide?
- Or do you simply run into memory issues when submitting more?
- Or are there other constraints that make task submission/queue slow after hitting some limit?
- What is the recommended way to handle a large amount (say between 100k and 1M) tasks? Batching is a bit annoying when dealing with failed tasks.
Running 100k+ tasks should be possible in a large-enough (e.g. 100+ node) cluster with a beefy head node in my opinion.
Any help and/or links to configs/docs is appreciated