Cluster crashes when using spot instances

Hello Ray veterans!

I am running a script that parallelizes my tasks using Ray Actors, with each node (16 vCPUs)
running exacty one Actor. I have set up my cluster to run on AWS. The script runs fine with this yaml-file:

cluster_name: onceagain

min_workers: 17
max_workers: 17
upscaling_speed: 1.0

idle_timeout_minutes: 5

docker:
    image: my_own_image_based_on_rayp36_gpu
    container_name: ray_container
    pull_before_run: True

provider:
    type: aws
    region: eu-central-1
    availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
    cache_stopped_nodes: False

auth:
    ssh_user: ec2-user
    ssh_private_key: key.pem

head_node:
    InstanceType: c5a.4xlarge
    ImageId: ami-097d024805419a86e
    KeyName: key

worker_nodes:
    InstanceType: c5a.4xlarge
    ImageId: ami-097d024805419a86e
    KeyName: key
    
initialization_commands:
    - docker login my_docker_account

setup_commands: []


# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

However, when I scale up the cluster to 38 nodes using spot instances, it crashes. This is the yaml-file for the crashing cluster:

cluster_name: onceagain

min_workers: 38
max_workers: 38
upscaling_speed: 1.0

idle_timeout_minutes: 5

docker:
    image: my_own_image_based_on_rayp36_gpu
    container_name: ray_container
    pull_before_run: True

provider:
    type: aws
    region: eu-central-1
    availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
    cache_stopped_nodes: False

auth:
    ssh_user: ec2-user
    ssh_private_key: key.pem

head_node:
    InstanceType: c5a.4xlarge
    ImageId: ami-097d024805419a86e
    KeyName: key

worker_nodes:
    InstanceType: c5a.4xlarge
    ImageId: ami-097d024805419a86e
    InstanceMarketOptions:
        MarketType: spot
    KeyName: key
    InstanceMarketOptions:
    MarketType: spot
    
initialization_commands:
    - docker login my_docker_account

setup_commands: []


# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

I can start the cluster just fine, autoscaler scales it to the required 38 nodes. However when I start my script it crashes. My script:

actorList = [mutator.remote(pdb3,ClusterRMSD,numRuns) for i in range(0,numNodes)]
output = list()
pool = ray.util.ActorPool(actorList)

output=list(pool.map_unordered(lambda a, v: a.performMutationDocking.remote(v),listofmutations))

listofmutations is a large string array containing 104 000 entries that are processed by the performMutationDocking.remote() function. Total size of array is around 2.5 MB.

The output of debug when cluster crashes:

[2021-05-28 05:56:19,397 C 5389 5389] service_based_gcs_client.cc:228: Couldn't reconnect to GCS server. The last attempted GCS server address was 172.31.23.79:42551
*** StackTrace Information ***
    @     0x7f1a6fcd4925  google::GetStackTraceToString()
    @     0x7f1a6fca366e  ray::GetCallTrace()
    @     0x7f1a6fcc86fc  ray::SpdLogMessage::Flush()
    @     0x7f1a6fcc882d  ray::RayLog::~RayLog()
    @     0x7f1a6f95d31f  ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
    @     0x7f1a6f95d435  ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
    @     0x7f1a6f95d5ab  ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
    @     0x7f1a6fc80e04  ray::PeriodicalRunner::DoRunFnPeriodically()
    @     0x7f1a6fc817cf  ray::PeriodicalRunner::RunFnPeriodically()
    @     0x7f1a6f95ed6e  ray::gcs::ServiceBasedGcsClient::Connect()
    @     0x7f1a6f827f4d  ray::gcs::GlobalStateAccessor::Connect()
    @     0x7f1a6f7614ab  __pyx_pw_3ray_7_raylet_19GlobalStateAccessor_3connect()
    @     0x556cb7775a5a  _PyCFunction_FastCallDict
    @     0x556cb77fda5c  call_function
    @     0x556cb782025a  _PyEval_EvalFrameDefault
    @     0x556cb77f6fd4  _PyEval_EvalCodeWithName
    @     0x556cb77f7e51  fast_function
    @     0x556cb77fdb35  call_function
    @     0x556cb782025a  _PyEval_EvalFrameDefault
    @     0x556cb77f7c1b  fast_function
    @     0x556cb77fdb35  call_function
    @     0x556cb782025a  _PyEval_EvalFrameDefault
    @     0x556cb77f7c1b  fast_function
    @     0x556cb77fdb35  call_function
    @     0x556cb782025a  _PyEval_EvalFrameDefault
    @     0x556cb77f6fd4  _PyEval_EvalCodeWithName
    @     0x556cb77f7e51  fast_function
    @     0x556cb77fdb35  call_function
    @     0x556cb7821019  _PyEval_EvalFrameDefault
    @     0x556cb77f6fd4  _PyEval_EvalCodeWithName
    @     0x556cb77f7e51  fast_function
    @     0x556cb77fdb35  call_function

Aborted

Autoscaler status seems fine:

======== Autoscaler status: 2021-05-28 05:57:35.629152 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray-legacy-head-node-type
 38 ray-legacy-worker-node-type
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/624.0 CPU
 0.00/843.590 GiB memory
 0.00/362.868 GiB object_store_memory

Demands:
 (no resource demands) 

Cluster runs based off the rayproject/ray:latest-py36-gpu image running Ray 1.3.0.
Any hints on how I can get my script running on a large spot instance cluster are appreciated! Thank you!

1 Like