Ray k8s cluster, cannot run new task when previous task failed

GoingMyWay · June 14, 2022, 1:46am

Did you restart your job in the same directory? For the working dir of the driver, it’ll be added automatically.

Yes. I restarted my job in the same directory.

If this still happens, do you mind trying to get a minimal reproducible script and submitting an issue for this?

I think the reason may be the ray cluster operator was created by the cluster administrator. I can give you the ray_cluster.yaml file which was used to create a new k8s ray cluster. I will try my best to put the reproducible code. I think the code is nothing different. I use ray.init("auto") to initialize ray.

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  name: ray-cluster
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 100
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: rayHead
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
    - name: rayHead
      minWorkers: 0
      maxWorkers: 0
      rayResources: {}
      podConfig:
        apiVersion: v1
        kind: Pod
        metadata:
          generateName: ray-head-
        spec:
          imagePullSecrets:
            - name: gitlab-cr-pull-secret
            - name: regcred
          priorityClassName: high
          restartPolicy: Never
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumes:
            - name: workspace-vol
              hostPath:
                path: /mnt/home/%USER/Projects/work_dir
                type: Directory 
            - name: dshm
              emptyDir:
                medium: Memory
          containers:
            - name: ray-node
              imagePullPolicy: Always
              image: "the.image:tag"
              # Do not change this command - it keeps the pod alive until it is
              # explicitly killed.
              command: ["/bin/bash", "-c", "--"]
              args: ["trap : TERM INT; sleep infinity & wait;"]
              env:
                - name: RAY_gcs_server_rpc_server_thread_num
                  value: "1"
              ports:
                - containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
                - containerPort: 10001 # Used by Ray Client
                - containerPort: 8265 # Used by Ray Dashboard
                - containerPort: 8000 # Used by Ray Serve

              # This volume allocates shared memory for Ray to use for its plasma
              # object store. If you do not provide this, Ray will fall back to
              # /tmp which cause slowdowns if is not a shared memory volume.
              volumeMounts:
                - name: workspace-vol
                  mountPath: /home/me/app/
                  readOnly: false
                - mountPath: /dev/shm
                  name: dshm
              resources:
                requests:
                  cpu: 10
                  memory: 100Gi
                  nvidia.com/gpu: 1
                limits:
                  cpu: 10
                  # The maximum memory that this pod is allowed to use. The
                  # limit will be detected by ray and split to use 10% for
                  # redis, 30% for the shared memory object store, and the
                  # rest for application memory. If this limit is not set and
                  # the object store size is not set manually, ray will
                  # allocate a very large object store in each pod that may
                  # cause problems for other pods.
                  memory: 50Gi
                  nvidia.com/gpu: 1
          nodeSelector: {}
          tolerations: []
    - name: rayWorker
      minWorkers: 2
      maxWorkers: 2
      rayResources: {}
      podConfig:
        apiVersion: v1
        kind: Pod
        metadata:
          generateName: ray-worker-
        spec:
          imagePullSecrets:
            - name: gitlab-cr-pull-secret
            - name: regcred
          priorityClassName: high
          restartPolicy: Never
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumes:
            - name: workspace-vol
              hostPath:
                path: /mnt/home/%USER/Projects/work_dir
                type: Directory
            - name: dshm
              emptyDir:
                medium: Memory
          containers:
            - name: ray-node
              imagePullPolicy: Always
              image: "the.image:tag"
              # Do not change this command - it keeps the pod alive until it is
              # explicitly killed.
              command: ["/bin/bash", "-c", "--"]
              args: ["trap : TERM INT; sleep infinity & wait;"]
              env:
                - name: RAY_gcs_server_rpc_server_thread_num
                  value: "1"
              ports:
                - containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
                - containerPort: 10001 # Used by Ray Client
                - containerPort: 8265 # Used by Ray Dashboard
                - containerPort: 8000 # Used by Ray Serve

              # This volume allocates shared memory for Ray to use for its plasma
              # object store. If you do not provide this, Ray will fall back to
              # /tmp which cause slowdowns if is not a shared memory volume.
              volumeMounts:
                - name: workspace-vol
                  mountPath: /home/me/app/
                  readOnly: false
                - mountPath: /dev/shm
                  name: dshm
              resources:
                requests:
                  cpu: 33
                  memory: 50Gi
                  nvidia.com/gpu: 0
                limits:
                  cpu: 33
                  # The maximum memory that this pod is allowed to use. The
                  # limit will be detected by ray and split to use 10% for
                  # redis, 30% for the shared memory object store, and the
                  # rest for application memory. If this limit is not set and
                  # the object store size is not set manually, ray will
                  # allocate a very large object store in each pod that may
                  # cause problems for other pods.
                  memory: 100Gi
                  nvidia.com/gpu: 0
          nodeSelector: {}
          tolerations: []
          
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

I use the following command to create a cluster.

kubectl -n ray_cluster apply -f ray_cluster.yaml

Topic		Replies	Views
Ray cluster crashes as soon as i add a worker Ray Clusters	1	41	August 26, 2024
ModuleNotFound error after ray.init() Ray Clusters	0	205	February 21, 2024
ModuleNotFoundError for ray.autoscaler._private._kubernetes Kubernetes	0	474	June 22, 2023
Ray_xgboost on K8 Kubernetes	2	481	January 9, 2024
Failure to serialize response Ray Clusters	2	1827	April 28, 2022

Ray k8s cluster, cannot run new task when previous task failed

Related topics