Ray k8s cluster, cannot run new task when previous task failed

  1. Did you restart your job in the same directory? For the working dir of the driver, it’ll be added automatically.

Yes. I restarted my job in the same directory.

  1. If this still happens, do you mind trying to get a minimal reproducible script and submitting an issue for this?

I think the reason may be the ray cluster operator was created by the cluster administrator. I can give you the ray_cluster.yaml file which was used to create a new k8s ray cluster. I will try my best to put the reproducible code. I think the code is nothing different. I use ray.init("auto") to initialize ray.

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  name: ray-cluster
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 100
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: rayHead
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
    - name: rayHead
      minWorkers: 0
      maxWorkers: 0
      rayResources: {}
      podConfig:
        apiVersion: v1
        kind: Pod
        metadata:
          generateName: ray-head-
        spec:
          imagePullSecrets:
            - name: gitlab-cr-pull-secret
            - name: regcred
          priorityClassName: high
          restartPolicy: Never
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumes:
            - name: workspace-vol
              hostPath:
                path: /mnt/home/%USER/Projects/work_dir
                type: Directory 
            - name: dshm
              emptyDir:
                medium: Memory
          containers:
            - name: ray-node
              imagePullPolicy: Always
              image: "the.image:tag"
              # Do not change this command - it keeps the pod alive until it is
              # explicitly killed.
              command: ["/bin/bash", "-c", "--"]
              args: ["trap : TERM INT; sleep infinity & wait;"]
              env:
                - name: RAY_gcs_server_rpc_server_thread_num
                  value: "1"
              ports:
                - containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
                - containerPort: 10001 # Used by Ray Client
                - containerPort: 8265 # Used by Ray Dashboard
                - containerPort: 8000 # Used by Ray Serve

              # This volume allocates shared memory for Ray to use for its plasma
              # object store. If you do not provide this, Ray will fall back to
              # /tmp which cause slowdowns if is not a shared memory volume.
              volumeMounts:
                - name: workspace-vol
                  mountPath: /home/me/app/
                  readOnly: false
                - mountPath: /dev/shm
                  name: dshm
              resources:
                requests:
                  cpu: 10
                  memory: 100Gi
                  nvidia.com/gpu: 1
                limits:
                  cpu: 10
                  # The maximum memory that this pod is allowed to use. The
                  # limit will be detected by ray and split to use 10% for
                  # redis, 30% for the shared memory object store, and the
                  # rest for application memory. If this limit is not set and
                  # the object store size is not set manually, ray will
                  # allocate a very large object store in each pod that may
                  # cause problems for other pods.
                  memory: 50Gi
                  nvidia.com/gpu: 1
          nodeSelector: {}
          tolerations: []
    - name: rayWorker
      minWorkers: 2
      maxWorkers: 2
      rayResources: {}
      podConfig:
        apiVersion: v1
        kind: Pod
        metadata:
          generateName: ray-worker-
        spec:
          imagePullSecrets:
            - name: gitlab-cr-pull-secret
            - name: regcred
          priorityClassName: high
          restartPolicy: Never
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumes:
            - name: workspace-vol
              hostPath:
                path: /mnt/home/%USER/Projects/work_dir
                type: Directory
            - name: dshm
              emptyDir:
                medium: Memory
          containers:
            - name: ray-node
              imagePullPolicy: Always
              image: "the.image:tag"
              # Do not change this command - it keeps the pod alive until it is
              # explicitly killed.
              command: ["/bin/bash", "-c", "--"]
              args: ["trap : TERM INT; sleep infinity & wait;"]
              env:
                - name: RAY_gcs_server_rpc_server_thread_num
                  value: "1"
              ports:
                - containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
                - containerPort: 10001 # Used by Ray Client
                - containerPort: 8265 # Used by Ray Dashboard
                - containerPort: 8000 # Used by Ray Serve

              # This volume allocates shared memory for Ray to use for its plasma
              # object store. If you do not provide this, Ray will fall back to
              # /tmp which cause slowdowns if is not a shared memory volume.
              volumeMounts:
                - name: workspace-vol
                  mountPath: /home/me/app/
                  readOnly: false
                - mountPath: /dev/shm
                  name: dshm
              resources:
                requests:
                  cpu: 33
                  memory: 50Gi
                  nvidia.com/gpu: 0
                limits:
                  cpu: 33
                  # The maximum memory that this pod is allowed to use. The
                  # limit will be detected by ray and split to use 10% for
                  # redis, 30% for the shared memory object store, and the
                  # rest for application memory. If this limit is not set and
                  # the object store size is not set manually, ray will
                  # allocate a very large object store in each pod that may
                  # cause problems for other pods.
                  memory: 100Gi
                  nvidia.com/gpu: 0
          nodeSelector: {}
          tolerations: []
          
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

I use the following command to create a cluster.

kubectl -n ray_cluster apply -f ray_cluster.yaml