Relationship between Ray Workers and trials and CPUs

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.43.0
  • Python version: 3.11.0
  • OS: Ubuntu 22.04.4 LTS
  • Cloud/Infrastructure: Azure
  • Other libs/tools (if relevant): Ray on Databricks Spark

3. Question:
Assumptions (correct me if I am wrong about any of these):

  • Each trial will by default be allocated 1 CPU.
  • Each trial corresponds to a Ray task.
  • Each Ray task is scheduled on a Ray Worker that has sufficient resources for it.
  • Each Ray Worker only runs one Ray task at a time.

Given the above, assume I have:

  • 8 Ray Workers (4 CPUs each)
  • 50 trials to be scheduled with default resource usage and no max_concurrency

With all of that, am I correct that the following will occur?

  • 8 trials will be started on the 8 Ray Workers at first
  • These running trials would be using 8 CPUs (1 per trial, as is default), with the other 24 CPUs being used for other Ray processes or idling
  • When one of the trials finishes, another will be scheduled so that at all times, as long as there are more trials to schedule, there will be 8 concurrent trials?

The reason I am confused is because I ran a trial with this configuration and found that there were 22 trials in the RUNNING state at some point, but I do not understand how 22 of them could be running at the same time.

Hey @ishaan-mehta, what do you mean by “Ray Worker” in this context? Are you referring to worker nodes?