Yes and no, it is currently working with a network load balancer with no SSL. We added to the configuration below to enable autoscaler and leverage Karpenter nodes. The key thing to note in the NLB definition is the selection of the ray head node. “ray.io/identifier: raycluster-complete-head”. Depending on your use case and since you already have the dashboard endpoint exposed you may be able to use Ray Job - Python SDK — Ray 2.0.0 instead. Our use case seemed better suited for interactive development though, which is why we needed to expose the client. That use case being a better local development experience for data scientists who need more compute or want to do distributed processing/training. We also use PrefectHQ/prefect-ray: Prefect integrations with Ray (github.com) which requires access to the client.
If you have any questions around the EKS setup itself let me know.
ray-complete.yaml
# The resource requests and limits in this config are too small for production!
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
labels:
controller-tools.k8s.io: "1.0"
# A unique identifier for the head node and workers of this cluster.
name: raycluster-complete
spec:
rayVersion: '2.0.0'
######################headGroupSpec#################################
# head group template and specs, (perhaps 'group' is not needed in the name)
headGroupSpec:
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
serviceType: NodePort
# for the head group, replicas should always be 1.
# headGroupSpec.replicas is deprecated in KubeRay >= 0.3.0.
replicas: 1
# the following params are used to complete the ray start: ray start --head --block --dashboard-host: '0.0.0.0' ...
rayStartParams:
dashboard-host: '0.0.0.0'
block: 'true'
#pod template
template:
metadata:
labels:
# custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
rayCluster: raycluster-sample # will be injected if missing
rayNodeType: head # will be injected if missing, must be head or wroker
groupName: headgroup # will be injected if missing
# annotations for pod
annotations:
key: value
spec:
containers:
- name: ray-head
image: rayproject/ray:2.0.0-py39-cpu
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "1"
memory: "2G"
requests:
cpu: "500m"
memory: "1G"
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
# logical group name, for this called large-group, also can be functional
groupName: large-group
# if worker pods need to be added, we can simply increment the replicas
# if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
# the operator will remove pods from the list until the number of replicas is satisfied
# when a pod is confirmed to be deleted, its name will be removed from the list below
#scaleStrategy:
# workersToDelete:
# - raycluster-complete-worker-large-group-bdtwh
# - raycluster-complete-worker-large-group-hv457
# - raycluster-complete-worker-large-group-k8tj7
# the following params are used to complete the ray start: ray start --block
rayStartParams:
block: 'true'
#pod template
template:
metadata:
labels:
rayCluster: raycluster-complete # will be injected if missing
rayNodeType: worker # will be injected if missing
groupName: small-group # will be injected if missing
spec:
containers:
- name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray:2.0.0-py39-cpu
# environment variables to set in the container.Optional.
# Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
# use volumeMounts.Optional.
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "500m"
memory: "256Mi"
initContainers:
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
- name: init-myservice
image: busybox:1.28
# Change the cluster postfix if you don't have a default setting
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
# use volumes
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
ray-nlb.yaml
apiVersion: v1
kind: Service
metadata:
name: ray-lb
annotations:
service.beta.kubernetes.io/aws-load-balancer-name: ray-cluster-nlb
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true
spec:
type: LoadBalancer
ports:
- name: ray-api
port: 10001
protocol: TCP
targetPort: 10001
- name: ray-dashboard
port: 8265
protocol: TCP
targetPort: 8265
selector:
ray.io/cluster: raycluster-complete
ray.io/identifier: raycluster-complete-head
ray.io/node-type: head