1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
2. Environment:
- Ray version: 2.53.0
- Python version: 3.13.11
- OS: ubuntu
- Cloud/Infrastructure: GCP (google cloud)
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: When I submit jobs to the cluster, all jobs run to completion.
- Actual: Jobs will randomly terminate. I can eventually run them by repeatedly resubmitting the job.
I have a config.yaml which I will post below. I use “ray up config.yaml” to set up my cluster on GCP. The head node boots successfully. I try to submit a job with “ray job submit --address=“$RAY_ADDRESS” --no-wait --entrypoint-num-gpus=1 – bash -c “$CMD”“. I submit 11 jobs in total. Some random subset of the jobs will start to run. The rest will fail. I show the error logs which I took from the head node under /tmp/ray/session_latest/logs/monitor.log. I then resubmit the failed jobs. Some will run while others fail. Eventually I can get all jobs to run, but this defeats the purpose of using Ray. The error messages are very cryptic. Perhaps a developer recognizes some of these and can point me the in the right direction?
Config.yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: l4-cluster
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 48
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker: {}
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 10
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-central1
availability_zone: us-central1-a
project_id: seqnn-170614 # Globally unique project id
cache_stopped_nodes: True
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: davidwang
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
# ssh_private_key: /path/to/your/key.pem
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray_head_default:
# The resources provided by this node type.
resources: {"CPU": 2}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/seqnn-170614/global/images/ray-dldsc-py313-v1
#sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-2-7-cu128-ubuntu-2204-nvidia-570
#sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
# Additional options can be found in in the compute docs at
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
# If the network interface is specified as below in both head and worker
# nodes, the manual network config is used. Otherwise an existing subnet is
# used. To use a shared subnet, ask the subnet owner to grant permission
# for 'compute.subnetworks.use' to the ray autoscaler account...
# networkInterfaces:
# - kind: compute#networkInterface
# subnetwork: path/to/subnet
# aliasIpRanges: []
ray_worker_small:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 0
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 48
# The resources provided by this node type.
resources: {"CPU": 32, "GPU": 1}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: g2-standard-32
scheduling:
provisioningModel: STANDARD
onHostMaintenance: TERMINATE
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100 # Node will not boot with under 100
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/seqnn-170614/global/images/ray-dldsc-py313-v1
#sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-2-7-cu128-ubuntu-2204-nvidia-570
#sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
- boot: false
source: projects/seqnn-170614/zones/us-central1-a/disks/dldsc-read-only
autoDelete: false # IMPORTANT: Don't delete my data when the worker dies!
mode: READ_ONLY # IMPORTANT: Allows multiple VMs to attach at once
deviceName: dldsc_disk
#- boot: false
# autoDelete: true
# type: SCRATCH
# interface: NVME
# initializeParams:
# diskType: projects/seqnn-170614/zones/us-central1-a/diskTypes/local-ssd
# Run workers on preemtible instance by default.
# Comment this out to use on-demand.
#scheduling:
# - preemptible: true
# Un-Comment this to launch workers with the Service Account of the Head Node
serviceAccounts:
- email: 1090276179925-compute@developer.gserviceaccount.com
scopes:
- https://www.googleapis.com/auth/cloud-platform
# Additional options can be found in in the compute docs at
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
"/home/davidwang/dldsc": "/home/davidwang/code_repos/dldsc",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# - curl -fsSL https://get.docker.com -o get-docker.sh
# - sudo sh get-docker.sh
# - sudo usermod -aG docker $USER
# - sudo systemctl restart docker -f
# List of shell commands to run to set up nodes.
setup_commands: []
#- pip install -U "ray[default]"
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip install google-api-python-client==1.7.8
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
#- sudo mkdir /mnt/local_ssd
#- sudo mkfs.ext4 -F /dev/nvme0n1
#- sudo mount /dev/nvme0n1 /mnt/local_ssd
#- sudo chmod 777 /mnt/local_ssd
- sudo mkdir -p /mnt/data
- sudo mount -o ro,noload /dev/disk/by-id/google-dldsc_disk /mnt/data
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
monitor.log. I’m isolating the lines for 1 worker: ray-l4-cluster-worker-1c19f7f0
(base) davidwang@ray-l4-cluster-head-81b470aa-compute:~$ cat /tmp/ray/session_latest/logs/monitor.log | grep "ray-l4-cluster-worker-1c19f7f0"
2026-02-07 03:00:36,684 INFO instance_manager.py:263 -- Update instance REQUESTED->ALLOCATED (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=, ray_id=): allocated unassigned cloud instance ray-l4-cluster-worker-1c19f7f0-compute
2026-02-07 03:00:36,725 INFO instance_manager.py:263 -- Update instance ALLOCATED->RAY_INSTALLING (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=): installing ray
2026-02-07 03:00:36,727 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node ray-l4-cluster-worker-1c19f7f0-compute.
2026-02-07 03:00:36,771 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:00:57,346 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:02:09,829 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:03:41,723 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:05:26,336 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:05:32,007 INFO threaded_ray_installer.py:81 -- Ray installation failed on instance ray-l4-cluster-worker-1c19f7f0-compute: Ray installation failed with unexpected status: setting-up
2026-02-07 03:05:32,020 INFO instance_manager.py:263 -- Update instance RAY_INSTALLING->RAY_RUNNING (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=): ray node f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6 is RUNNING
2026-02-07 03:05:42,008 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node ray-l4-cluster-worker-1c19f7f0-compute.
2026-02-07 03:05:42,026 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:07:17,116 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:07:46,233 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:07:51,820 INFO threaded_ray_installer.py:81 -- Ray installation failed on instance ray-l4-cluster-worker-1c19f7f0-compute: Ray installation failed with unexpected status: waiting-for-ssh
2026-02-07 03:07:51,822 INFO instance_manager.py:263 -- Update instance RAY_RUNNING->RAY_STOPPED (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6): ray node f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6 is DEAD
2026-02-07 03:07:51,832 INFO instance_manager.py:263 -- Update instance RAY_STOPPED->TERMINATING (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6): terminating instance from RAY_STOPPED
2026-02-07 03:07:51,834 INFO node_provider.py:184 -- NodeProvider: ray-l4-cluster-worker-1c19f7f0-compute: Terminating node
2026-02-07 03:07:51,850 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/stop?alt=json
2026-02-07 03:08:11,824 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node ray-l4-cluster-worker-1c19f7f0-compute.
2026-02-07 03:08:11,838 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:09:00,506 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:09:00,784 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:09:01,413 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:09:01,570 INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATED (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6): cloud instance ray-l4-cluster-worker-1c19f7f0-compute no longer found
2026-02-07 03:09:01,571 INFO instance_manager.py:263 -- Update instance ALLOCATED->RAY_RUNNING (id=1725d5bf-95ff-4770-8326-60e5b29baa72, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=7edd216a9666384e1daea9bbc1a3ed64914941c12b54d2d084b962ed): ray node 7edd216a9666384e1daea9bbc1a3ed64914941c12b54d2d084b962ed is IDLE
2026-02-07 03:09:01,768 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:09:01,936 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:09:02,420 INFO threaded_ray_installer.py:81 -- Ray installation failed on instance ray-l4-cluster-worker-1c19f7f0-compute: wait_ready aborting because node detected as terminated.
2026-02-07 03:10:00,055 INFO instance_manager.py:263 -- Update instance RAY_RUNNING->TERMINATED (id=1725d5bf-95ff-4770-8326-60e5b29baa72, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=7edd216a9666384e1daea9bbc1a3ed64914941c12b54d2d084b962ed): cloud instance ray-l4-cluster-worker-1c19f7f0-compute no longer found
2026-02-07 03:18:20,680 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/start?alt=json
2026-02-07 03:18:53,025 INFO instance_manager.py:263 -- Update instance REQUESTED->ALLOCATED (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=, ray_id=): allocated unassigned cloud instance ray-l4-cluster-worker-1c19f7f0-compute
2026-02-07 03:18:53,077 INFO instance_manager.py:263 -- Update instance ALLOCATED->RAY_INSTALLING (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=): installing ray
2026-02-07 03:18:53,080 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node ray-l4-cluster-worker-1c19f7f0-compute.
2026-02-07 03:18:53,212 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:19:32,332 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:20:22,047 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:20:41,642 INFO instance_manager.py:263 -- Update instance RAY_INSTALLING->RAY_RUNNING (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=): ray node 3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0 is RUNNING
2026-02-07 03:32:07,731 INFO instance_manager.py:263 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0): draining ray: idle for 604.971 secs > timeout=600 secs
2026-02-07 03:32:13,126 INFO instance_manager.py:263 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0): ray node 3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0 is DEAD
2026-02-07 03:32:13,144 INFO instance_manager.py:263 -- Update instance RAY_STOPPED->TERMINATING (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0): terminating instance from RAY_STOPPED
2026-02-07 03:33:15,528 INFO node_provider.py:184 -- NodeProvider: ray-l4-cluster-worker-1c19f7f0-compute: Terminating node
2026-02-07 03:33:15,818 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/stop?alt=json
2026-02-07 03:35:05,516 INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATED (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0): cloud instance ray-l4-cluster-worker-1c19f7f0-compute no longer found