Our ray cluster went into a bad state today. I’m not sure if this is the root cause, but the logs from ray_client_server.err
seem relevant.
Could you provide some details on your deployment setup and the Ray versions used?
It was running this ray version 1eecb7d80b3b24b4e5caa837de2250019b9cc967.
Here is the ray config
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
name: ray-cluster
spec:
# The maximum number of workers nodes to launch in addition to the head node.
maxWorkers: 50
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscalingSpeed: 10.0
# If a node is idle for this many minutes, it will be removed.
idleTimeoutMinutes: 10
# Specify the pod type for the ray head node (as configured below).
headPodType: head-node
# Optionally, configure ports for the Ray head service.
# The ports specified below are the defaults.
headServicePorts:
- name: client
port: 10001
targetPort: 10001
- name: dashboard
port: 8265
targetPort: 8265
- name: ray-serve
port: 8000
targetPort: 8000
- name: redis-primary
port: 6379
targetPort: 6379
# Specify the allowed pod types for this ray cluster and the resources they provide.
podTypes:
- name: head-node
# Minimum number of Ray workers of this Pod type.
minWorkers: 0
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 0
# Prevent tasks on head node. https://docs.ray.io/en/master/cluster/guide.html#configuring-the-head-node
rayResources: {"CPU": 0}
podConfig:
apiVersion: v1
kind: Pod
metadata:
# The operator automatically prepends the cluster name to this field.
generateName: ray-head-
spec:
tolerations:
- key: imerso-ray-head
operator: Equal
value: "true"
effect: NoSchedule
restartPolicy: Never
nodeSelector:
imerso-ray-head: "true"
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: filestore-ray
persistentVolumeClaim:
claimName: fileserver-ray-claim
readOnly: false
containers:
- name: ray-node
imagePullPolicy: Always
image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
ports:
- containerPort: 6379 # Redis port
- containerPort: 10001 # Used by Ray Client
- containerPort: 8265 # Used by Ray Dashboard
- containerPort: 8000 # Used by Ray Serve
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /filestore
name: filestore-ray
resources:
requests:
cpu: 1
memory: 5Gi
limits:
memory: 5Gi
- name: worker-node-cpu
# Minimum number of Ray workers of this Pod type.
minWorkers: 0
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 50
# User-specified custom resources for use by Ray.
# (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
# rayResources: {"example-resource-a": 1, "example-resource-b": 1}
podConfig:
apiVersion: v1
kind: Pod
metadata:
# The operator automatically prepends the cluster name to this field.
generateName: ray-worker-cpu-
spec:
tolerations:
- key: cloud.google.com/gke-preemptible
operator: Equal
value: "true"
effect: NoSchedule
- key: imerso-ray-worker
operator: Equal
value: "true"
effect: NoSchedule
serviceAccountName: ray-prod
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: filestore-ray
persistentVolumeClaim:
claimName: fileserver-ray-claim
readOnly: false
containers:
- name: ray-node
imagePullPolicy: Always
image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /filestore
name: filestore-ray
resources:
requests:
cpu: 7
memory: 26G
limits:
memory: 26G
- name: worker-node-cpu-highmem-8
# Minimum number of Ray workers of this Pod type.
minWorkers: 0
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 5
# User-specified custom resources for use by Ray.
# (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
# rayResources: {"example-resource-a": 1, "example-resource-b": 1}
podConfig:
apiVersion: v1
kind: Pod
metadata:
# The operator automatically prepends the cluster name to this field.
generateName: ray-worker-cpu-highmem-8-
spec:
tolerations:
- key: cloud.google.com/gke-preemptible
operator: Equal
value: "true"
effect: NoSchedule
- key: imerso-ray-worker
operator: Equal
value: "true"
effect: NoSchedule
- key: imerso-ray-worker-highmem-8
operator: Equal
value: "true"
effect: NoSchedule
serviceAccountName: ray-prod
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: filestore-ray
persistentVolumeClaim:
claimName: fileserver-ray-claim
readOnly: false
containers:
- name: ray-node
imagePullPolicy: Always
image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /filestore
name: filestore-ray
resources:
requests:
cpu: 7
memory: 60G
limits:
memory: 60G
- name: worker-node-cpu-highmem-16
# Minimum number of Ray workers of this Pod type.
minWorkers: 0
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 5
# User-specified custom resources for use by Ray.
# (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
# rayResources: {"example-resource-a": 1, "example-resource-b": 1}
podConfig:
apiVersion: v1
kind: Pod
metadata:
# The operator automatically prepends the cluster name to this field.
generateName: ray-worker-cpu-highmem-16-
spec:
tolerations:
- key: cloud.google.com/gke-preemptible
operator: Equal
value: "true"
effect: NoSchedule
- key: imerso-ray-worker
operator: Equal
value: "true"
effect: NoSchedule
- key: imerso-ray-worker-highmem-16
operator: Equal
value: "true"
effect: NoSchedule
serviceAccountName: ray-prod
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: filestore-ray
persistentVolumeClaim:
claimName: fileserver-ray-claim
readOnly: false
containers:
- name: ray-node
imagePullPolicy: Always
image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /filestore
name: filestore-ray
resources:
requests:
cpu: 15
memory: 124G
limits:
memory: 124G
- name: worker-node-gpu
# Minimum number of Ray workers of this Pod type.
minWorkers: 0
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 20
# User-specified custom resources for use by Ray.
# (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
# rayResources: {"example-resource-a": 1, "example-resource-b": 1}
podConfig:
apiVersion: v1
kind: Pod
metadata:
# The operator automatically prepends the cluster name to this field.
generateName: ray-worker-gpu-
spec:
tolerations:
- key: cloud.google.com/gke-preemptible
operator: Equal
value: "true"
effect: NoSchedule
- key: imerso-ray-worker
operator: Equal
value: "true"
effect: NoSchedule
serviceAccountName: ray-prod
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: filestore-ray
persistentVolumeClaim:
claimName: fileserver-ray-claim
readOnly: false
containers:
- name: ray-node
imagePullPolicy: Always
image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /filestore
name: filestore-ray
resources:
requests:
cpu: 7
memory: 26G
limits:
memory: 26G
nvidia.com/gpu: 1
# Commands to start Ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
headStartRayCommands:
- ray stop
- ulimit -n 65536; export AUTOSCALER_MAX_NUM_FAILURES=inf; ray start --head --num-cpus=0 --object-store-memory 1073741824 --no-monitor --dashboard-host 0.0.0.0 &> /tmp/raylogs
# Commands to start Ray on worker nodes. You don't need to change this.
workerStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --object-store-memory 1073741824 --address=$RAY_HEAD_IP:6379 &> /tmp/raylogs
Let me know if there are any other details that could be useful.
Just extracting some info – the commit used is this one from Oct 4:
The section of relevant logs:
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
return getattr(stub, method)(request, metadata=metadata)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1634815527.832445756","description":"Error received from peer ipv6:[::1]:23113","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
INFO:ray.util.client.server.proxier:Specific server ffa755c9e86f4c5399a6172ce22a26e5 is no longer running, freeing its port 23113
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
return getattr(stub, method)(request, metadata=metadata)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1634815551.054329699","description":"Error received from peer ipv6:[::1]:23115","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
return getattr(stub, method)(request, metadata=metadata)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1634815554.959582707","description":"Error received from peer ipv6:[::1]:23116","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
return getattr(stub, method)(request, metadata=metadata)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1634815555.062253720","description":"Error received from peer ipv6:[::1]:23109","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
return getattr(stub, method)(request, metadata=metadata)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1634815556.039259950","description":"Error received from peer ipv6:[::1]:23110","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
return getattr(stub, method)(request, metadata=metadata)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1634815557.234403991","description":"Error received from peer ipv6:[::1]:23111","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
INFO:ray.util.client.server.proxier:Specific server 80f291c08c4845e4aadd2dd0d1f0fcf0 is no longer running, freeing its port 23108
INFO:ray.util.client.server.proxier:Specific server a9aafbcdbcba4dbfb21263a600cf28ce is no longer running, freeing its port 23112
INFO:ray.util.client.server.proxier:08e5995d1981437d99377558233317c6 last started stream at 1634815167.3072171. Current stream started at 1634815167.3072171.
INFO:ray.util.client.server.proxier:Specific server 08e5995d1981437d99377558233317c6 is no longer running, freeing its port 23114
INFO:ray.util.client.server.proxier:New data connection from client 8744d9033cb3416aa25902803291b065:
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23117 with PID: 54518 for client: 8744d9033cb3416aa25902803291b065
INFO:ray.util.client.server.proxier:New data connection from client 87f98529f5454a73bdac549f05d89c94:
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23118 with PID: 59342 for client: 87f98529f5454a73bdac549f05d89c94
INFO:ray.util.client.server.proxier:2ebaacf4dd714a05805759b4b3d7dfe3 last started stream at 1634814361.1348305. Current stream started at 1634814361.1348305.
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
return getattr(stub, method)(request, metadata=metadata)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1634819231.796495424","description":"Error received from peer ipv6:[::1]:23107","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
INFO:ray.util.client.server.proxier:New data connection from client 1b4fc481d86744ce9636ca328dfc30ad:
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23119 with PID: 61946 for client: 1b4fc481d86744ce9636ca328dfc30ad
INFO:ray.util.client.server.proxier:Specific server 2ebaacf4dd714a05805759b4b3d7dfe3 is no longer running, freeing its port 23107
INFO:ray.util.client.server.proxier:New data connection from client b1c7576deae440bc99114d0b69828f44:
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23120 with PID: 64378 for client: b1c7576deae440bc99114d0b69828f44
Hmm, I’m not super sure what happened here. @ckw017 any ideas?
Not sure either. Looks like the issue is between the proxier and the specific server (specific server closing socket?) Not sure what would kill the specific server in this case though