Previously well running script does not allocate resources correctly anymore

Hi everyone!

I am facing a weird issue: A script that worked perfectly well 3 months ago now does not anymore. I am running Ray on an AWS cluster with my own docker images.

The yaml file:

cluster_name: yeehaw

min_workers: 17
max_workers: 17
upscaling_speed: 1.0

idle_timeout_minutes: 5

docker:
    image: account/container
    container_name: ray_container
    pull_before_run: True

provider:
    type: aws
    region: eu-central-1
    availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
    cache_stopped_nodes: False

auth:
    ssh_user: ec2-user
    ssh_private_key: VLX.pem

head_node:
    InstanceType: c5a.4xlarge
    ImageId: ami-097d024805419a86e
    KeyName: VLX

worker_nodes:
    InstanceType: c5a.4xlarge
    ImageId: ami-097d024805419a86e
    KeyName: VLX
#    InstanceMarketOptions:
#        MarketType: spot
    
initialization_commands:
    - docker login -user -p password

setup_commands: []


# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

The script:

numCPUscluster = 288
actorList = [mutator.remote(listPosPackable,pdb3) for i in range(0,(numCPUscluster-1))] # create actors that will run calculations
output = list() # list that will hold the futures and the results
pool = ray.util.ActorPool(actorList) # generates ray actor pool

output=list(pool.map_unordered(lambda a, v: a.performMutationBinding.remote(v),listofmutations)) # pass all calculations to actor pool and distribute evenly

When I run this script, the following output of the cluster:

ffffffffffffc2da15158350db76dee656d301000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {16.000000/16.000000 CPU, 21.710038 GiB/21.710038 GiB memory, 9.304302 GiB/9.304302 GiB
object_store_memory, 1.000000/1.000000 node:172.31.23.131}
. In total there are 0 pending tasks and 20 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Ray monitor tells me this:

2021-11-27 10:58:23,480	INFO autoscaler.py:309 -- 
======== Autoscaler status: 2021-11-27 10:58:23.480464 ========
Node status
---------------------------------------------------------------
Healthy:
 17 ray-legacy-worker-node-type
 1 ray-legacy-head-node-type
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 240.0/288.0 CPU
 0.00/387.679 GiB memory
 0.16/167.477 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 47+ pending tasks/actors

So the resources are definitly there, for some reason ray is not able to allocate the actors to the free CPUs. Script runs on for ages and never finishes. THe strange thing is, 3 months ago everything worked like a charm.
I am on Ray 1.3.0.

Debug outputs:

)
    @     0x55a10924da5a  _PyCFunction_FastCallDict
    @     0x55a1092d5a5c  call_function
    @     0x55a1092f825a  _PyEval_EvalFrameDefault
    @     0x55a1092cefd4  _PyEval_EvalCodeWithName
    @     0x55a1092cfe51  fast_function
    @     0x55a1092d5b35  call_function
    @     0x55a1092f825a  _PyEval_EvalFrameDefault
    @     0x55a1092cfc1b  fast_function
    @     0x55a1092d5b35  call_function
    @     0x55a1092f825a  _PyEval_EvalFrameDefault
    @     0x55a1092cfc1b  fast_function
    @     0x55a1092d5b35  call_function
    @     0x55a1092f825a  _PyEval_EvalFrameDefault
    @     0x55a1092cefd4  _PyEval_EvalCodeWithName
    @     0x55a1092cfe51  fast_function
    @     0x55a1092d5b35  call_function
    @     0x55a1092f9019  _PyEval_EvalFrameDefault
    @     0x55a1092cefd4  _PyEval_EvalCodeWithName
    @     0x55a1092cfe51  fast_function
    @     0x55a1092d5b35  call_function

Aborted (core dumped)

Any ideas as to what goes wrong very welcome as I cannot make sense of this. Thanks!

What’s the version of Ray are you using?

Hi!

Thanks for the question.
I am on Ray 1.3.0.