Scale up from 0

Is it possible for Ray cluster to scale-up from 0 workers? If I start a Kubernetes cluster using minWorkers: 0 and I start a job no workers are launched. If I change that to minWorkers: 1 then everything scales up and works fine. The autoscaler doesn’t seem to recognize the demand when there are no workers.

======== Autoscaler status: 2021-07-13 05:56:40.044341 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.00/1.367 GiB memory
 0.00/0.571 GiB object_store_memory

Demands:
 (no resource demands)
ray,ray:2021-07-13 05:56:40,046	DEBUG legacy_info_string.py:24 -- Cluster status: 0 nodes
 - MostDelayedHeartbeats: {'10.16.202.163': 0.14696931838989258}
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - ResourceUsage: 0.0 GiB/1.37 GiB memory, 0.0 GiB/0.57 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
ray,ray:2021-07-13 05:56:45,170	DEBUG resource_demand_scheduler.py:160 -- Cluster resources: [{'memory': 1468006400.0, 'object_store_memory': 612928463.0, 'node:10.16.202.163': 1.0}]
ray,ray:2021-07-13 05:56:45,170	DEBUG resource_demand_scheduler.py:161 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1})
ray,ray:2021-07-13 05:56:45,170	DEBUG resource_demand_scheduler.py:172 -- Placement group demands: []
ray,ray:2021-07-13 05:56:45,171	DEBUG resource_demand_scheduler.py:218 -- Resource demands: []
ray,ray:2021-07-13 05:56:45,171	DEBUG resource_demand_scheduler.py:219 -- Unfulfilled demands: []
ray,ray:2021-07-13 05:56:45,190	DEBUG resource_demand_scheduler.py:241 -- Node requests: {}
ray,ray:2021-07-13 05:56:45,218	INFO autoscaler.py:354 -- 

Got it. Can you provide the Python script you run after the cluster is launched?

This seems to be a recent regression – are you deploying with helm and the default images?

import time

import ray

LOCAL_PORT = 10001


@ray.remote
def do_some_work(x):
    print('doing some work')
    time.sleep(1)  # Replace this is with work you need to do.
    return x*x


def main():
    start = time.time()
    results = [ray.get(do_some_work.remote(x)) for x in range(4)]
    print("duration =", time.time() - start)
    print("results = ", results)


if __name__ == '__main__':
    ray.client(f'127.0.0.1:{LOCAL_PORT}').connect()
    main()

I am deploying with helm and the head node is using the default image. The worker nodes use a custom image derived from rayproject/ray:latest-py37-cpu.

Got it.
Would you mind pasting the values.yaml to make sure we’re not missing anything?

podTypes:
  rayHeadType:
    memory: 2000Mi
    nodeSelector:
      role: head
    rayResources:
      CPU: 0
  rayWorkerType:
    CPU: 62
    maxWorkers: 800
    memory: 244000Mi
    minWorkers: 1
    nodeSelector:
      role: worker

Thanks! The 0 CPU annotation is an important detail.