Hi everyone!
I am facing a weird issue: A script that worked perfectly well 3 months ago now does not anymore. I am running Ray on an AWS cluster with my own docker images.
The yaml file:
cluster_name: yeehaw
min_workers: 17
max_workers: 17
upscaling_speed: 1.0
idle_timeout_minutes: 5
docker:
image: account/container
container_name: ray_container
pull_before_run: True
provider:
type: aws
region: eu-central-1
availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
cache_stopped_nodes: False
auth:
ssh_user: ec2-user
ssh_private_key: VLX.pem
head_node:
InstanceType: c5a.4xlarge
ImageId: ami-097d024805419a86e
KeyName: VLX
worker_nodes:
InstanceType: c5a.4xlarge
ImageId: ami-097d024805419a86e
KeyName: VLX
# InstanceMarketOptions:
# MarketType: spot
initialization_commands:
- docker login -user -p password
setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
The script:
numCPUscluster = 288
actorList = [mutator.remote(listPosPackable,pdb3) for i in range(0,(numCPUscluster-1))] # create actors that will run calculations
output = list() # list that will hold the futures and the results
pool = ray.util.ActorPool(actorList) # generates ray actor pool
output=list(pool.map_unordered(lambda a, v: a.performMutationBinding.remote(v),listofmutations)) # pass all calculations to actor pool and distribute evenly
When I run this script, the following output of the cluster:
ffffffffffffc2da15158350db76dee656d301000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {16.000000/16.000000 CPU, 21.710038 GiB/21.710038 GiB memory, 9.304302 GiB/9.304302 GiB
object_store_memory, 1.000000/1.000000 node:172.31.23.131}
. In total there are 0 pending tasks and 20 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
Ray monitor tells me this:
2021-11-27 10:58:23,480 INFO autoscaler.py:309 --
======== Autoscaler status: 2021-11-27 10:58:23.480464 ========
Node status
---------------------------------------------------------------
Healthy:
17 ray-legacy-worker-node-type
1 ray-legacy-head-node-type
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
240.0/288.0 CPU
0.00/387.679 GiB memory
0.16/167.477 GiB object_store_memory
Demands:
{'CPU': 1.0}: 47+ pending tasks/actors
So the resources are definitly there, for some reason ray is not able to allocate the actors to the free CPUs. Script runs on for ages and never finishes. THe strange thing is, 3 months ago everything worked like a charm.
I am on Ray 1.3.0.
Debug outputs:
)
@ 0x55a10924da5a _PyCFunction_FastCallDict
@ 0x55a1092d5a5c call_function
@ 0x55a1092f825a _PyEval_EvalFrameDefault
@ 0x55a1092cefd4 _PyEval_EvalCodeWithName
@ 0x55a1092cfe51 fast_function
@ 0x55a1092d5b35 call_function
@ 0x55a1092f825a _PyEval_EvalFrameDefault
@ 0x55a1092cfc1b fast_function
@ 0x55a1092d5b35 call_function
@ 0x55a1092f825a _PyEval_EvalFrameDefault
@ 0x55a1092cfc1b fast_function
@ 0x55a1092d5b35 call_function
@ 0x55a1092f825a _PyEval_EvalFrameDefault
@ 0x55a1092cefd4 _PyEval_EvalCodeWithName
@ 0x55a1092cfe51 fast_function
@ 0x55a1092d5b35 call_function
@ 0x55a1092f9019 _PyEval_EvalFrameDefault
@ 0x55a1092cefd4 _PyEval_EvalCodeWithName
@ 0x55a1092cfe51 fast_function
@ 0x55a1092d5b35 call_function
Aborted (core dumped)
Any ideas as to what goes wrong very welcome as I cannot make sense of this. Thanks!