Hello Ray veterans!
I am running a script that parallelizes my tasks using Ray Actors, with each node (16 vCPUs)
running exacty one Actor. I have set up my cluster to run on AWS. The script runs fine with this yaml-file:
cluster_name: onceagain
min_workers: 17
max_workers: 17
upscaling_speed: 1.0
idle_timeout_minutes: 5
docker:
image: my_own_image_based_on_rayp36_gpu
container_name: ray_container
pull_before_run: True
provider:
type: aws
region: eu-central-1
availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
cache_stopped_nodes: False
auth:
ssh_user: ec2-user
ssh_private_key: key.pem
head_node:
InstanceType: c5a.4xlarge
ImageId: ami-097d024805419a86e
KeyName: key
worker_nodes:
InstanceType: c5a.4xlarge
ImageId: ami-097d024805419a86e
KeyName: key
initialization_commands:
- docker login my_docker_account
setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
However, when I scale up the cluster to 38 nodes using spot instances, it crashes. This is the yaml-file for the crashing cluster:
cluster_name: onceagain
min_workers: 38
max_workers: 38
upscaling_speed: 1.0
idle_timeout_minutes: 5
docker:
image: my_own_image_based_on_rayp36_gpu
container_name: ray_container
pull_before_run: True
provider:
type: aws
region: eu-central-1
availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
cache_stopped_nodes: False
auth:
ssh_user: ec2-user
ssh_private_key: key.pem
head_node:
InstanceType: c5a.4xlarge
ImageId: ami-097d024805419a86e
KeyName: key
worker_nodes:
InstanceType: c5a.4xlarge
ImageId: ami-097d024805419a86e
InstanceMarketOptions:
MarketType: spot
KeyName: key
InstanceMarketOptions:
MarketType: spot
initialization_commands:
- docker login my_docker_account
setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
I can start the cluster just fine, autoscaler scales it to the required 38 nodes. However when I start my script it crashes. My script:
actorList = [mutator.remote(pdb3,ClusterRMSD,numRuns) for i in range(0,numNodes)]
output = list()
pool = ray.util.ActorPool(actorList)
output=list(pool.map_unordered(lambda a, v: a.performMutationDocking.remote(v),listofmutations))
listofmutations is a large string array containing 104 000 entries that are processed by the performMutationDocking.remote() function. Total size of array is around 2.5 MB.
The output of debug when cluster crashes:
[2021-05-28 05:56:19,397 C 5389 5389] service_based_gcs_client.cc:228: Couldn't reconnect to GCS server. The last attempted GCS server address was 172.31.23.79:42551
*** StackTrace Information ***
@ 0x7f1a6fcd4925 google::GetStackTraceToString()
@ 0x7f1a6fca366e ray::GetCallTrace()
@ 0x7f1a6fcc86fc ray::SpdLogMessage::Flush()
@ 0x7f1a6fcc882d ray::RayLog::~RayLog()
@ 0x7f1a6f95d31f ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
@ 0x7f1a6f95d435 ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
@ 0x7f1a6f95d5ab ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
@ 0x7f1a6fc80e04 ray::PeriodicalRunner::DoRunFnPeriodically()
@ 0x7f1a6fc817cf ray::PeriodicalRunner::RunFnPeriodically()
@ 0x7f1a6f95ed6e ray::gcs::ServiceBasedGcsClient::Connect()
@ 0x7f1a6f827f4d ray::gcs::GlobalStateAccessor::Connect()
@ 0x7f1a6f7614ab __pyx_pw_3ray_7_raylet_19GlobalStateAccessor_3connect()
@ 0x556cb7775a5a _PyCFunction_FastCallDict
@ 0x556cb77fda5c call_function
@ 0x556cb782025a _PyEval_EvalFrameDefault
@ 0x556cb77f6fd4 _PyEval_EvalCodeWithName
@ 0x556cb77f7e51 fast_function
@ 0x556cb77fdb35 call_function
@ 0x556cb782025a _PyEval_EvalFrameDefault
@ 0x556cb77f7c1b fast_function
@ 0x556cb77fdb35 call_function
@ 0x556cb782025a _PyEval_EvalFrameDefault
@ 0x556cb77f7c1b fast_function
@ 0x556cb77fdb35 call_function
@ 0x556cb782025a _PyEval_EvalFrameDefault
@ 0x556cb77f6fd4 _PyEval_EvalCodeWithName
@ 0x556cb77f7e51 fast_function
@ 0x556cb77fdb35 call_function
@ 0x556cb7821019 _PyEval_EvalFrameDefault
@ 0x556cb77f6fd4 _PyEval_EvalCodeWithName
@ 0x556cb77f7e51 fast_function
@ 0x556cb77fdb35 call_function
Aborted
Autoscaler status seems fine:
======== Autoscaler status: 2021-05-28 05:57:35.629152 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray-legacy-head-node-type
38 ray-legacy-worker-node-type
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/624.0 CPU
0.00/843.590 GiB memory
0.00/362.868 GiB object_store_memory
Demands:
(no resource demands)
Cluster runs based off the rayproject/ray:latest-py36-gpu image running Ray 1.3.0.
Any hints on how I can get my script running on a large spot instance cluster are appreciated! Thank you!