Why is my autoscaling cluster not scaling up to max when tuning?

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.43.0
  • Python version: 3.11.0
  • OS: Ubuntu 22.04.4 LTS
  • Cloud/Infrastructure: Azure
  • Other libs/tools (if relevant): Ray on Databricks Spark

3. What happened vs. what you expected:
Context:

  • num_samples=50
  • Spark cluster set up with 4 workers of 8 cores each (non-autoscaling Spark cluster)
  • Ray cluster set up so that one Ray worker has 8 CPUs (one Ray worker per Spark worker)
  • Ray cluster is set up to autoscale from 1 to 4 Ray workers
  • No resources per trial specified, so should default to 1 CPU per trial if I’m not mistaken

Behavior:

  • Expected: When I tune using TuneBOHB without any specified max_concurrency, it should set no limit on concurrency, so it should schedule as many tasks as possible, meaning it should use 4 workers (for 4 trials) at a time. This means if we start the cluster with 1 worker, it should request additional workers until we hit 4 workers.
  • Actual: The entire tuning run finishes using only 1 worker the whole time.

Hi there Ishaan and welcome to the Ray community :slight_smile:

I have a few questions, thanks for explaining your use case! I know you mentioned not using max_concurrency but what happens if you explicitly set a number for max_concurrent_trials, does it work as intended if you set it to 4? Sometimes it might help to be explicit and just to test to see if it works like that at all.

Also, from a Spark notebook or shell (where you’ve called ray.init(...)), run: import ray print(ray.cluster_resources()). This should hypothetically show a total of ≥ 32 CPUs (4 workers * 8 CPUs each), if Ray truly sees and has registered all the worker nodes. If you only see ~8 CPUs, then your Ray cluster is not actually attaching to all Spark workers yet.

Also check ray status or ray monitor (depending on how you’ve launched Ray) to confirm whether Ray is requesting new workers and whether they are being provisioned. Can you let me know what that says?