(k8s) Ray Operator + Ray Client example seems to not use all pods

Hi all,

I worked through the tutorial in Deploying on Kubernetes and wanted to check if the behavior I observed matches what is expected.

After performing the installation, the resources output looks good:

$ kubectl -n ray get rayclusters
NAME              STATUS    RESTARTS   AGE
example-cluster   Running   0          41m
$ kubectl -n ray get pods
NAME                                    READY   STATUS    RESTARTS   AGE
example-cluster-ray-head-type-lzffq     1/1     Running   0          40m
example-cluster-ray-worker-type-rmsrv   1/1     Running   0          39m
example-cluster-ray-worker-type-xnzww   1/1     Running   0          39m
$ kubectl -n ray get service
NAME                       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                       AGE
example-cluster-ray-head   ClusterIP   10.12.6.199   <none>        10001/TCP,8265/TCP,8000/TCP   41m
$ kubectl get deployment ray-operator
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
ray-operator   1/1     1            1           41m
$ kubectl get pod -l cluster.ray.io/component=operator
NAME                            READY   STATUS    RESTARTS   AGE
ray-operator-799f457484-wzqkg   1/1     Running   0          42m
$ kubectl get crd rayclusters.cluster.ray.io
NAME                         CREATED AT
rayclusters.cluster.ray.io   2021-06-25T18:18:32Z

Then, after forwarding the Ray Client server port and running run_local_example.py, I received the following output:

Iteration 0
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 1
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 2
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 3
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 4
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 5
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 6
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 7
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 8
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Iteration 9
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 100})
Success!

It seems that only the head pod was used, and that the two worker pods were not. Is this expected behavior/a correct interpretation of the results? And if not expected, how might it be resolved?

Thanks!

After playing around with the example a bit, I found that increasing the amount of sleep time seemed to induce the expected behavior. With the time changed from 0.01 to 2 as follows:

@ray.remote
def gethostname(x):
    import platform
    import time
    time.sleep(2)
    return x + (platform.node(), )

results were more in line with what I expected:

Iteration 0
Counter({('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-xnzww'): 8, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-99rcx'): 8, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-99rcx'): 7, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 7, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-xnzww'): 7, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-rmsrv'): 7, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-xnzww'): 7, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-rmsrv'): 7, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-head-type-lzffq'): 7, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-head-type-lzffq'): 6, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-99rcx'): 6, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-rmsrv'): 6, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-99rcx'): 5, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-xnzww'): 4, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-head-type-lzffq'): 4, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-rmsrv'): 4})
Iteration 1
Counter({('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-head-type-lzffq'): 20, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-xnzww'): 20, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-rmsrv'): 18, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-99rcx'): 18, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-rmsrv'): 4, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-99rcx'): 4, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-99rcx'): 2, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-head-type-lzffq'): 2, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-rmsrv'): 2, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-xnzww'): 2, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-head-type-lzffq'): 2, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-worker-type-xnzww'): 2, ('example-cluster-ray-head-type-lzffq', 'example-cluster-ray-worker-type-99rcx'): 1, ('example-cluster-ray-worker-type-99rcx', 'example-cluster-ray-head-type-lzffq'): 1, ('example-cluster-ray-worker-type-rmsrv', 'example-cluster-ray-worker-type-xnzww'): 1, ('example-cluster-ray-worker-type-xnzww', 'example-cluster-ray-worker-type-rmsrv'): 1})

Etc. etc. Could it be that for some cluster infrastructures the 0.01 sleep time falls beneath the threshold for the sort of object-passing that the tutorial hopes to induce? (I am using GKE.)