(k8s) Ray Operator + Ray Client example seems to not use all pods

twillkens · June 25, 2021, 7:19pm

Hi all,

I worked through the tutorial in Deploying on Kubernetes and wanted to check if the behavior I observed matches what is expected.

After performing the installation, the resources output looks good:

$ kubectl -n ray get rayclusters
NAME              STATUS    RESTARTS   AGE
example-cluster   Running   0          41m
$ kubectl -n ray get pods
NAME                                    READY   STATUS    RESTARTS   AGE
example-cluster-ray-head-type-lzffq     1/1     Running   0          40m
example-cluster-ray-worker-type-rmsrv   1/1     Running   0          39m
example-cluster-ray-worker-type-xnzww   1/1     Running   0          39m
$ kubectl -n ray get service
NAME                       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                       AGE
example-cluster-ray-head   ClusterIP   10.12.6.199   <none>        10001/TCP,8265/TCP,8000/TCP   41m
$ kubectl get deployment ray-operator
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
ray-operator   1/1     1            1           41m
$ kubectl get pod -l cluster.ray.io/component=operator
NAME                            READY   STATUS    RESTARTS   AGE
ray-operator-799f457484-wzqkg   1/1     Running   0          42m
$ kubectl get crd rayclusters.cluster.ray.io
NAME                         CREATED AT
rayclusters.cluster.ray.io   2021-06-25T18:18:32Z

Then, after forwarding the Ray Client server port and running run_local_example.py, I received the following output:

Iteration 0
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 1
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 2
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 3
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 4
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 5
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 6
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 7
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 8
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 9
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Success!

It seems that only the head pod was used, and that the two worker pods were not. Is this expected behavior/a correct interpretation of the results? And if not expected, how might it be resolved?

Thanks!

twillkens · June 25, 2021, 8:32pm

After playing around with the example a bit, I found that increasing the amount of sleep time seemed to induce the expected behavior. With the time changed from 0.01 to 2 as follows:

@ray.remote
def gethostname(x):
    import platform
    import time
    time.sleep(2)
    return x + (platform.node(), )

results were more in line with what I expected:

Iteration 0
Counter({('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-xnzww'): 8, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-99rcx'): 8, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-99rcx'): 7, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 7, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-xnzww'): 7, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-rmsrv'): 7, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-xnzww'): 7, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-rmsrv'): 7, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-head-type-lzffq'): 7, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-head-type-lzffq'): 6, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-99rcx'): 6, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-rmsrv'): 6, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-99rcx'): 5, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-xnzww'): 4, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-head-type-lzffq'): 4, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-rmsrv'): 4})
Iteration 1
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 20, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-xnzww'): 20, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-rmsrv'): 18, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-99rcx'): 18, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-rmsrv'): 4, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-99rcx'): 4, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-99rcx'): 2, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-head-type-lzffq'): 2, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-rmsrv'): 2, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-xnzww'): 2, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-head-type-lzffq'): 2, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-xnzww'): 2, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-99rcx'): 1, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-head-type-lzffq'): 1, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-xnzww'): 1, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-rmsrv'): 1})

Etc. etc. Could it be that for some cluster infrastructures the 0.01 sleep time falls beneath the threshold for the sort of object-passing that the tutorial hopes to induce? (I am using GKE.)

Topic		Replies	Views
Multiple head nodes on kubernetes Kubernetes	2	908	February 25, 2021
Some questions about Ray on Kubernetes Ray Clusters	3	771	December 3, 2021
[Cluster] [K8] Is the client.server automatically started in Ray 1.2.0 when running on K8? Kubernetes	1	954	April 18, 2021
Kubernetes cluster only creates head node Ray Clusters	11	784	June 7, 2022
Ray on k8s, how to properly config head node Ray Clusters	4	897	June 24, 2022

(k8s) Ray Operator + Ray Client example seems to not use all pods

Related topics