- Did you restart your job in the same directory? For the working dir of the driver, it’ll be added automatically.
Yes. I restarted my job in the same directory.
- If this still happens, do you mind trying to get a minimal reproducible script and submitting an issue for this?
I think the reason may be the ray cluster operator was created by the cluster administrator. I can give you the ray_cluster.yaml
file which was used to create a new k8s ray cluster. I will try my best to put the reproducible code. I think the code is nothing different. I use ray.init("auto")
to initialize ray.
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
name: ray-cluster
spec:
# The maximum number of workers nodes to launch in addition to the head node.
maxWorkers: 100
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscalingSpeed: 1.0
# If a node is idle for this many minutes, it will be removed.
idleTimeoutMinutes: 5
# Specify the pod type for the ray head node (as configured below).
headPodType: rayHead
# Specify the allowed pod types for this ray cluster and the resources they provide.
podTypes:
- name: rayHead
minWorkers: 0
maxWorkers: 0
rayResources: {}
podConfig:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-head-
spec:
imagePullSecrets:
- name: gitlab-cr-pull-secret
- name: regcred
priorityClassName: high
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: workspace-vol
hostPath:
path: /mnt/home/%USER/Projects/work_dir
type: Directory
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: "the.image:tag"
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; sleep infinity & wait;"]
env:
- name: RAY_gcs_server_rpc_server_thread_num
value: "1"
ports:
- containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
- containerPort: 10001 # Used by Ray Client
- containerPort: 8265 # Used by Ray Dashboard
- containerPort: 8000 # Used by Ray Serve
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- name: workspace-vol
mountPath: /home/me/app/
readOnly: false
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 10
memory: 100Gi
nvidia.com/gpu: 1
limits:
cpu: 10
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 50Gi
nvidia.com/gpu: 1
nodeSelector: {}
tolerations: []
- name: rayWorker
minWorkers: 2
maxWorkers: 2
rayResources: {}
podConfig:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-worker-
spec:
imagePullSecrets:
- name: gitlab-cr-pull-secret
- name: regcred
priorityClassName: high
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: workspace-vol
hostPath:
path: /mnt/home/%USER/Projects/work_dir
type: Directory
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: "the.image:tag"
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; sleep infinity & wait;"]
env:
- name: RAY_gcs_server_rpc_server_thread_num
value: "1"
ports:
- containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
- containerPort: 10001 # Used by Ray Client
- containerPort: 8265 # Used by Ray Dashboard
- containerPort: 8000 # Used by Ray Serve
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- name: workspace-vol
mountPath: /home/me/app/
readOnly: false
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 33
memory: 50Gi
nvidia.com/gpu: 0
limits:
cpu: 33
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 100Gi
nvidia.com/gpu: 0
nodeSelector: {}
tolerations: []
# Commands to start Ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
headStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0
# Commands to start Ray on worker nodes. You don't need to change this.
workerStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
I use the following command to create a cluster.
kubectl -n ray_cluster apply -f ray_cluster.yaml