Proxying call to GetObject failed

Our ray cluster went into a bad state today. I’m not sure if this is the root cause, but the logs from ray_client_server.err seem relevant.

https://controlc.com/519f16fe

Could you provide some details on your deployment setup and the Ray versions used?

It was running this ray version 1eecb7d80b3b24b4e5caa837de2250019b9cc967.

Here is the ray config

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  name: ray-cluster
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 50
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 10.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 10
  # Specify the pod type for the ray head node (as configured below).
  headPodType: head-node
  # Optionally, configure ports for the Ray head service.
  # The ports specified below are the defaults.
  headServicePorts:
    - name: client
      port: 10001
      targetPort: 10001
    - name: dashboard
      port: 8265
      targetPort: 8265
    - name: ray-serve
      port: 8000
      targetPort: 8000
    - name: redis-primary
      port: 6379
      targetPort: 6379
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
  - name: head-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 0
    # Prevent tasks on head node. https://docs.ray.io/en/master/cluster/guide.html#configuring-the-head-node
    rayResources: {"CPU": 0}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # The operator automatically prepends the cluster name to this field.
        generateName: ray-head-
      spec:
        tolerations:
        - key: imerso-ray-head
          operator: Equal
          value: "true"
          effect: NoSchedule
        restartPolicy: Never
        nodeSelector:
          imerso-ray-head: "true"

        # This volume allocates shared memory for Ray to use for its plasma
        # object store. If you do not provide this, Ray will fall back to
        # /tmp which cause slowdowns if is not a shared memory volume.
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: filestore-ray
          persistentVolumeClaim:
            claimName: fileserver-ray-claim
            readOnly: false
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
          # Do not change this command - it keeps the pod alive until it is
          # explicitly killed.
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
          ports:
          - containerPort: 6379  # Redis port
          - containerPort: 10001  # Used by Ray Client
          - containerPort: 8265  # Used by Ray Dashboard
          - containerPort: 8000 # Used by Ray Serve

          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /filestore
            name: filestore-ray
          resources:
            requests:
              cpu: 1
              memory: 5Gi
            limits:
              memory: 5Gi
  - name: worker-node-cpu
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 50
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    # rayResources: {"example-resource-a": 1, "example-resource-b": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # The operator automatically prepends the cluster name to this field.
        generateName: ray-worker-cpu-
      spec:
        tolerations:
        - key: cloud.google.com/gke-preemptible
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: imerso-ray-worker
          operator: Equal
          value: "true"
          effect: NoSchedule
        serviceAccountName: ray-prod
        restartPolicy: Never
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: filestore-ray
          persistentVolumeClaim:
            claimName: fileserver-ray-claim
            readOnly: false
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /filestore
            name: filestore-ray
          resources:
            requests:
              cpu: 7
              memory: 26G
            limits:
              memory: 26G
  - name: worker-node-cpu-highmem-8
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 5
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    # rayResources: {"example-resource-a": 1, "example-resource-b": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # The operator automatically prepends the cluster name to this field.
        generateName: ray-worker-cpu-highmem-8-
      spec:
        tolerations:
        - key: cloud.google.com/gke-preemptible
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: imerso-ray-worker
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: imerso-ray-worker-highmem-8
          operator: Equal
          value: "true"
          effect: NoSchedule
        serviceAccountName: ray-prod
        restartPolicy: Never
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: filestore-ray
          persistentVolumeClaim:
            claimName: fileserver-ray-claim
            readOnly: false
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /filestore
            name: filestore-ray
          resources:
            requests:
              cpu: 7
              memory: 60G
            limits:
              memory: 60G
  - name: worker-node-cpu-highmem-16
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 5
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    # rayResources: {"example-resource-a": 1, "example-resource-b": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # The operator automatically prepends the cluster name to this field.
        generateName: ray-worker-cpu-highmem-16-
      spec:
        tolerations:
        - key: cloud.google.com/gke-preemptible
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: imerso-ray-worker
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: imerso-ray-worker-highmem-16
          operator: Equal
          value: "true"
          effect: NoSchedule
        serviceAccountName: ray-prod
        restartPolicy: Never
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: filestore-ray
          persistentVolumeClaim:
            claimName: fileserver-ray-claim
            readOnly: false
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /filestore
            name: filestore-ray
          resources:
            requests:
              cpu: 15
              memory: 124G
            limits:
              memory: 124G
  - name: worker-node-gpu
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 20
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    # rayResources: {"example-resource-a": 1, "example-resource-b": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # The operator automatically prepends the cluster name to this field.
        generateName: ray-worker-gpu-
      spec:
        tolerations:
        - key: cloud.google.com/gke-preemptible
          operator: Equal
          value: "true"
          effect: NoSchedule
        - key: imerso-ray-worker
          operator: Equal
          value: "true"
          effect: NoSchedule
        serviceAccountName: ray-prod
        restartPolicy: Never
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: filestore-ray
          persistentVolumeClaim:
            claimName: fileserver-ray-claim
            readOnly: false
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /filestore
            name: filestore-ray
          resources:
            requests:
              cpu: 7
              memory: 26G
            limits:
              memory: 26G
              nvidia.com/gpu: 1
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; export AUTOSCALER_MAX_NUM_FAILURES=inf; ray start --head --num-cpus=0 --object-store-memory 1073741824 --no-monitor --dashboard-host 0.0.0.0 &> /tmp/raylogs
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --object-store-memory 1073741824 --address=$RAY_HEAD_IP:6379 &> /tmp/raylogs

Let me know if there are any other details that could be useful.

cc @ckw017 @ijrsvt

Just extracting some info – the commit used is this one from Oct 4:

The section of relevant logs:

ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
    return getattr(stub, method)(request, metadata=metadata)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1634815527.832445756","description":"Error received from peer ipv6:[::1]:23113","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
INFO:ray.util.client.server.proxier:Specific server ffa755c9e86f4c5399a6172ce22a26e5 is no longer running, freeing its port 23113
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
    return getattr(stub, method)(request, metadata=metadata)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1634815551.054329699","description":"Error received from peer ipv6:[::1]:23115","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
    return getattr(stub, method)(request, metadata=metadata)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1634815554.959582707","description":"Error received from peer ipv6:[::1]:23116","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
    return getattr(stub, method)(request, metadata=metadata)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1634815555.062253720","description":"Error received from peer ipv6:[::1]:23109","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
    return getattr(stub, method)(request, metadata=metadata)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1634815556.039259950","description":"Error received from peer ipv6:[::1]:23110","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
    return getattr(stub, method)(request, metadata=metadata)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1634815557.234403991","description":"Error received from peer ipv6:[::1]:23111","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
INFO:ray.util.client.server.proxier:Specific server 80f291c08c4845e4aadd2dd0d1f0fcf0 is no longer running, freeing its port 23108
INFO:ray.util.client.server.proxier:Specific server a9aafbcdbcba4dbfb21263a600cf28ce is no longer running, freeing its port 23112
INFO:ray.util.client.server.proxier:08e5995d1981437d99377558233317c6 last started stream at 1634815167.3072171. Current stream started at 1634815167.3072171.
INFO:ray.util.client.server.proxier:Specific server 08e5995d1981437d99377558233317c6 is no longer running, freeing its port 23114
INFO:ray.util.client.server.proxier:New data connection from client 8744d9033cb3416aa25902803291b065: 
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23117 with PID: 54518 for client: 8744d9033cb3416aa25902803291b065
INFO:ray.util.client.server.proxier:New data connection from client 87f98529f5454a73bdac549f05d89c94: 
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23118 with PID: 59342 for client: 87f98529f5454a73bdac549f05d89c94
INFO:ray.util.client.server.proxier:2ebaacf4dd714a05805759b4b3d7dfe3 last started stream at 1634814361.1348305. Current stream started at 1634814361.1348305.
ERROR:ray.util.client.server.proxier:Proxying call to GetObject failed!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 391, in _call_inner_function
    return getattr(stub, method)(request, metadata=metadata)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1634819231.796495424","description":"Error received from peer ipv6:[::1]:23107","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"
>
ERROR:grpc._common:Exception serializing message!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
TypeError: descriptor 'SerializeToString' for 'google.protobuf.pyext._message.CMessage' objects doesn't apply to a 'NoneType' object
INFO:ray.util.client.server.proxier:New data connection from client 1b4fc481d86744ce9636ca328dfc30ad: 
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23119 with PID: 61946 for client: 1b4fc481d86744ce9636ca328dfc30ad
INFO:ray.util.client.server.proxier:Specific server 2ebaacf4dd714a05805759b4b3d7dfe3 is no longer running, freeing its port 23107
INFO:ray.util.client.server.proxier:New data connection from client b1c7576deae440bc99114d0b69828f44: 
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23120 with PID: 64378 for client: b1c7576deae440bc99114d0b69828f44

Hmm, I’m not super sure what happened here. @ckw017 any ideas?

Not sure either. Looks like the issue is between the proxier and the specific server (specific server closing socket?) Not sure what would kill the specific server in this case though