Error Scaling Ray Serve to 2 Replicas

omnific9 · August 9, 2021, 10:27pm

Hello community. I’m new to both Ray and Kubernetes, and yet here I am, trying to set up a Ray Serve product on Kubernetes.

I managed to wade through some of the learning challenges and set up a Kubernetes cluster using Minikube locally, with 2 nodes. I then set up a simple Ray Serve deployment following this quickstart: Deploying Ray Serve — Ray v2.0.0.dev0 and it worked well with just 1 replica.

When I added “num_replicas=2” to the serve.deployment clause, all of a sudden I’m getting this message:

WARNING backend_state.py:928 – Backend ‘hello’ has 1 replicas that have taken more than 30s to start up. This may be caused by waiting for the cluster to auto-scale or because the constructor is slow. Resources required for each replica: {‘CPU’: 1}, resources available: {}

I couldn’t find any other error messages but maybe there are logs that I’m not aware of? I tried running another Minikube example and was able to spin up 2 replicas of an API endpoint on the 2 nodes, so it looks like other programs are able to function properly on my cluster. I’m confused with this message and will appreciate if you could help me understand what’s going on.

Thanks!

eoakes · August 10, 2021, 4:33pm

@omnific9 this error message means that Serve is trying to create the requested replica but there aren’t currently enough resources in the cluster (given that you’re running on k8s, this means there aren’t enough Pods created). Assuming you’re using the Kubernetes operator / helm chart, the autoscaler should be creating more Pods to satisfy this request but it seems like it might not be.

One thing you can do to figure out what’s going on is run ray status on the head node, that should give you a periodic log from the autoscaler with what it’s trying to do. You can also see the autoscaler logs in /tmp/ray/session_latest/logs/autoscaler.{log,err}. If you post the output of either of those, I should be able to help more!

omnific9 · August 10, 2021, 5:09pm

@eoakes thanks for the response!

I ran the following command to get the status:

(venv) ➜ ray git:( releases/1.4.1 ) ✗ ray status --address localhost:6379 --redis_password “”
No cluster status.

So a couple of things here. My local redis cluster doesn’t have a password: is this the correct parameter to pass to the status command? Secondly, there seems to be no cluster status, and I wonder why that is.

Also, I’m on Macbook and there’s no “/tmp/ray” folder.

Are these things that I should’ve set up manually, or is there something wrong with my ray setup?

architkulkarni · August 10, 2021, 7:49pm

I think /tmp/ray/session_latest/logs/autoscaler.{log,err} would be in the filesystem of the nodes in the Kubernetes cluster, do you have a way of accessing that? cc’ing @Dmitri for Kubernetes/autoscaler help

Dmitri · August 10, 2021, 8:04pm

On K8s, to get autoscaling status, it’s
kubectl -n <ray namespace> exec <ray head pod> -- ray status

If using the helm chart / operator, to get the full autoscaling logs, it’s
kubectl -n <operator namespace> logs <operator pod>

omnific9 · August 10, 2021, 11:46pm

Thanks for getting back @architkulkarni and @Dmitri

I attached onto the pod but there doesn’t seem to be an autoscaler.* file. I’ve copied the list of all the files under /tmp/ray/session_latest/logs/ below:
(base) ray@example-cluster-ray-head-j2bqx:~$ vi /tmp/ray/session_latest/logs/

dashboard.log redis-shard_0.out

dashboard_agent.log redis.err

gcs_server.err redis.out

gcs_server.out worker-10e4f82206fd98f0eb80337c7959b55ce475b5459223841dc5dfb0b2-01000000-552.err

log_monitor.log worker-10e4f82206fd98f0eb80337c7959b55ce475b5459223841dc5dfb0b2-01000000-552.out

monitor.err worker-15f2b2a5f40a30c9fd730facd5d02a1463d42b086fb3c8e9c5067755-01000000-887.err

monitor.log worker-15f2b2a5f40a30c9fd730facd5d02a1463d42b086fb3c8e9c5067755-01000000-887.out

monitor.out worker-1d42d3ae8f262bf75e4cdc02ea834d7726d1a7b43471c4e7ad80f76b-01000000-368.err

old/ worker-1d42d3ae8f262bf75e4cdc02ea834d7726d1a7b43471c4e7ad80f76b-01000000-368.out

python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_334.log worker-20120211afcd1281c52f4cae64d2493ba251a112d484f8444d4c7608-01000000-755.err

python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_519.log worker-20120211afcd1281c52f4cae64d2493ba251a112d484f8444d4c7608-01000000-755.out

python-core-driver-03000000ffffffffffffffffffffffffffffffffffffffffffffffff_687.log worker-3484d04f1f74910e7c225783f4ac12547d7b60d69d2c7965ddd45761-01000000-447.err

python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_854.log worker-3484d04f1f74910e7c225783f4ac12547d7b60d69d2c7965ddd45761-01000000-447.out

python-core-worker-10e4f82206fd98f0eb80337c7959b55ce475b5459223841dc5dfb0b2_552.log worker-55ad6b7f62cb49237812276a3d96afc175566dabc3471b9e78bc7da2-01000000-888.err

python-core-worker-15f2b2a5f40a30c9fd730facd5d02a1463d42b086fb3c8e9c5067755_887.log worker-55ad6b7f62cb49237812276a3d96afc175566dabc3471b9e78bc7da2-01000000-888.out

python-core-worker-1d42d3ae8f262bf75e4cdc02ea834d7726d1a7b43471c4e7ad80f76b_368.log worker-5eb829eaeacced6d9f9ca863a2b4951d6da7472c56361889592f9418-01000000-614.err

python-core-worker-20120211afcd1281c52f4cae64d2493ba251a112d484f8444d4c7608_755.log worker-5eb829eaeacced6d9f9ca863a2b4951d6da7472c56361889592f9418-01000000-614.out

python-core-worker-3484d04f1f74910e7c225783f4ac12547d7b60d69d2c7965ddd45761_447.log worker-61e1b794bb4aae224d03ab1c28802f5bef0a8b3174c3be664c65a988-01000000-982.err

python-core-worker-55ad6b7f62cb49237812276a3d96afc175566dabc3471b9e78bc7da2_888.log worker-61e1b794bb4aae224d03ab1c28802f5bef0a8b3174c3be664c65a988-01000000-982.out

python-core-worker-5eb829eaeacced6d9f9ca863a2b4951d6da7472c56361889592f9418_614.log worker-7a7889e06bebf2494651bd1a586892f01f8f31d26792b5511d65e94c-01000000-555.err

python-core-worker-61e1b794bb4aae224d03ab1c28802f5bef0a8b3174c3be664c65a988_982.log worker-7a7889e06bebf2494651bd1a586892f01f8f31d26792b5511d65e94c-01000000-555.out

python-core-worker-7a7889e06bebf2494651bd1a586892f01f8f31d26792b5511d65e94c_555.log worker-92060c48bf0e3414f036b5092bdf6928c064d43d827dacf5c7c9dbbe-01000000-889.err

python-core-worker-92060c48bf0e3414f036b5092bdf6928c064d43d827dacf5c7c9dbbe_889.log worker-92060c48bf0e3414f036b5092bdf6928c064d43d827dacf5c7c9dbbe-01000000-889.out

python-core-worker-92c19dc12e47d97c46259208152f8204fe22ac5ff1055e7e2369b288_752.log worker-92c19dc12e47d97c46259208152f8204fe22ac5ff1055e7e2369b288-01000000-752.err

python-core-worker-c57e0eba806f6e3b7edf903eec28466dd57d52ea78b209043d276a54_395.log worker-92c19dc12e47d97c46259208152f8204fe22ac5ff1055e7e2369b288-01000000-752.out

python-core-worker-dfea5b24e8dd6d2029b091119b37cfcb4c6537a9ed217f7e451febab_890.log worker-c57e0eba806f6e3b7edf903eec28466dd57d52ea78b209043d276a54-01000000-395.err

python-core-worker-f5e3465db81b4d0b69fb85881ae55723e51c3ea707382b77b059e89a_720.log worker-c57e0eba806f6e3b7edf903eec28466dd57d52ea78b209043d276a54-01000000-395.out

python-core-worker-f6d3eb7eeeb69c8b014e28db15e0d2e82bad337e67416f55504b96e0_396.log worker-dfea5b24e8dd6d2029b091119b37cfcb4c6537a9ed217f7e451febab-01000000-890.err

ray_client_server.err worker-dfea5b24e8dd6d2029b091119b37cfcb4c6537a9ed217f7e451febab-01000000-890.out

ray_client_server.out worker-f5e3465db81b4d0b69fb85881ae55723e51c3ea707382b77b059e89a-01000000-720.err

raylet.err worker-f5e3465db81b4d0b69fb85881ae55723e51c3ea707382b77b059e89a-01000000-720.out

raylet.out worker-f6d3eb7eeeb69c8b014e28db15e0d2e82bad337e67416f55504b96e0-01000000-396.err

redis-shard_0.err worker-f6d3eb7eeeb69c8b014e28db15e0d2e82bad337e67416f55504b96e0-01000000-396.out

omnific9 · August 10, 2021, 11:48pm

As for @Dmitri 's method, here’s the status right after I tried running 5 replicas:

➜ ray git:(releases/1.4.1) ✗ kubectl -n ray exec example-cluster-ray-head-j2bqx – ray status

======== Autoscaler status: 2021-08-10 16:41:14.985306 ========

Node status

Healthy:

1 head_node

Pending:

(no pending nodes)

Recent failures:

(no failures)

Resources

Usage:

4.0/4.0 CPU

0.0/1.0 CPU_group_0_1a05a614d31404430b219ade936e9e03

0.0/1.0 CPU_group_0_68f029c34779e27ecdc0ab25c8c5c6ea

0.0/1.0 CPU_group_0_a56f2612e58b97d6310a9db96c6aa948

0.0/1.0 CPU_group_0_ea8b9e5878f0d39ea69a6cea41f1d919

1.0/1.0 CPU_group_1a05a614d31404430b219ade936e9e03

1.0/1.0 CPU_group_68f029c34779e27ecdc0ab25c8c5c6ea

1.0/1.0 CPU_group_a56f2612e58b97d6310a9db96c6aa948

1.0/1.0 CPU_group_ea8b9e5878f0d39ea69a6cea41f1d919

0.00/1.400 GiB memory

0.00/0.585 GiB object_store_memory

Demands:

{‘CPU_group_41c2a9cef5e5273468fe536c57e34ffa’: 1.0}: 1+ pending tasks/actors

{‘CPU’: 1.0} * 1 (PACK): 1+ pending placement groups

Although I did run the ray helm chart to set it up, running “kubectl -n ray logs example-cluster-ray-head-j2bqx” showed nothing. Did I get that command right?

Dmitri · August 11, 2021, 3:30am

It’s a little confusing – for the logs command, your target should be the operator pod which by default is in namespace default.

Dmitri · August 11, 2021, 3:31am

It appears from the ray status output that there’s only one ray node (the head) up.

omnific9 · August 11, 2021, 11:44pm

@Dmitri thanks for the clarification. So I thought maybe it had to do with me running helm chart before starting two nodes in Minikube, so I did the following:

kubectl -n ray delete raycluster example-cluster
helm -n ray uninstall example-cluster

And then, with the two Minikube nodes, I ran the following commands:

(venv) ➜  ray git:(releases/1.4.1) ✗ kubectl -n ray delete raycluster example-cluster
raycluster.cluster.ray.io "example-cluster" deleted
(venv) ➜  ray git:(releases/1.4.1) ✗ helm -n ray uninstall example-cluster
release "example-cluster" uninstalled
(venv) ➜  ray git:(releases/1.4.1) ✗ kubectl -n ray get rayclusters
NAME              STATUS    RESTARTS   AGE
example-cluster   Running   0          26s
(venv) ➜  ray git:(releases/1.4.1) ✗ kubectl -n ray get pods
NAME                                    READY   STATUS        RESTARTS   AGE
example-cluster-ray-head-type-rw89l     1/1     Running       0          29s
example-cluster-ray-worker-type-7jn6p   0/1     Terminating   0          17s
example-cluster-ray-worker-type-kj4mq   0/1     Terminating   0          17s
(venv) ➜  ray git:(releases/1.4.1) ✗ kubectl -n ray get service
NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
example-cluster-ray-head   ClusterIP   10.96.164.162   <none>        10001/TCP,8265/TCP,8000/TCP   45s
(venv) ➜  ray git:(releases/1.4.1) ✗ kubectl get deployment ray-operator
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
ray-operator   1/1     1            1           57s
(venv) ➜  ray git:(releases/1.4.1) ✗ kubectl get pod -l cluster.ray.io/component=operator
NAME                            READY   STATUS    RESTARTS   AGE
ray-operator-55c86765c9-ml9l7   1/1     Running   0          64s
(venv) ➜  ray git:(releases/1.4.1) ✗ kubectl get crd rayclusters.cluster.ray.io
NAME                         CREATED AT
rayclusters.cluster.ray.io   2021-08-11T23:20:20Z

Here I noticed two worker nodes were terminating. Not sure why.

I proceeded to run the following:
ray up -y python/ray/autoscaler/kubernetes/example-full.yaml --no-config-cache
ray submit python/ray/autoscaler/kubernetes/example-full.yaml deploy.py

Again it showed me the message

(pid=326) 2021-08-11 16:42:05,282	WARNING backend_state.py:928 -- Backend 'hello' has 1 replicas that have taken more than 30s to start up. This may be caused by waiting for the cluster to auto-scale or because the constructor is slow. Resources required for each replica: {'CPU': 1}, resources available: {}.

I then looked at the log. It’s breaking the character limit, so I’ll be posting it in the next few comments. Please let me know if this tells us anything useful, or if I can dig up anything else.

omnific9 · August 11, 2021, 11:49pm

Hmm. The forum wouldn’t let me post a long message even if I limit the post length. Here’s to try to post a small portion of the error message again from the logs:

======== Autoscaler status: 2021-08-11 16:34:22.421266 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
Pending:
 172.17.0.5: rayWorkerType, setting-up
 172.17.0.7: rayWorkerType, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/1.0 CPU
 0.00/0.350 GiB memory
 0.00/0.135 GiB object_store_memory

Demands:
 (no resource demands)
example-cluster,ray:2021-08-11 16:34:22,444	DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes (2 updating) (4 failed to update)
 - MostDelayedHeartbeats: {'172.17.0.4': 0.254608154296875}
 - NodeIdleSeconds: Min=54 Mean=54 Max=54
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/0.35 GiB memory, 0.0 GiB/0.13 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - rayWorkerType: 2
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 666, in wait_for_redis_to_start
    redis_client.client_list()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 1194, in client_list
    return self.execute_command('CLIENT LIST')
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.17.0.4:6379. Connection refused.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1808, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 656, in start
    redis_address_ip, redis_address_port, password=redis_password)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 688, in wait_for_redis_to_start
    " attempts to ping the Redis server.") from connEx
RuntimeError: Unable to connect to Redis at 172.17.0.4:6379 after 12 retries. Check that 172.17.0.4:6379 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable `RAY_START_REDIS_WAIT_RETRIES` to increase the number of attempts to ping the Redis server.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 666, in wait_for_redis_to_start
    redis_client.client_list()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 1194, in client_list
    return self.execute_command('CLIENT LIST')
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.17.0.4:6379. Connection refused.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1808, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 656, in start
    redis_address_ip, redis_address_port, password=redis_password)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 688, in wait_for_redis_to_start
    " attempts to ping the Redis server.") from connEx
RuntimeError: Unable to connect to Redis at 172.17.0.4:6379 after 12 retries. Check that 172.17.0.4:6379 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable `RAY_START_REDIS_WAIT_RETRIES` to increase the number of attempts to ping the Redis server.
command terminated with exit code 1
command terminated with exit code 1
Exception in thread Thread-15:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 134, in run
    self.do_update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 468, in do_update
    run_env="auto")
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 178, in run
    self.process_runner.check_call(final_cmd, shell=True)
  File "/home/ray/anaconda3/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'kubectl -n ray exec -it example-cluster-ray-worker-type-fnftz -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1,"memory":375809638}'"'"';export RAY_HEAD_IP=172.17.0.4; ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379)'' returned non-zero exit status 1.

Exception in thread Thread-16:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 134, in run
    self.do_update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 468, in do_update
    run_env="auto")
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 178, in run
    self.process_runner.check_call(final_cmd, shell=True)
  File "/home/ray/anaconda3/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'kubectl -n ray exec -it example-cluster-ray-worker-type-ptmtt -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1,"memory":375809638}'"'"';export RAY_HEAD_IP=172.17.0.4; ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379)'' returned non-zero exit status 1.

example-cluster,ray:2021-08-11 16:34:27,660	DEBUG resource_demand_scheduler.py:160 -- Cluster resources: [{'object_store_memory': 144745267.0, 'node:172.17.0.4': 1.0, 'CPU': 1.0, 'memory': 375809638.0}, {'CPU': 1, 'memory': 375809638}, {'CPU': 1, 'memory': 375809638}]
example-cluster,ray:2021-08-11 16:34:27,661	DEBUG resource_demand_scheduler.py:161 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
example-cluster,ray:2021-08-11 16:34:27,661	DEBUG resource_demand_scheduler.py:172 -- Placement group demands: []
example-cluster,ray:2021-08-11 16:34:27,661	DEBUG resource_demand_scheduler.py:218 -- Resource demands: []
example-cluster,ray:2021-08-11 16:34:27,661	DEBUG resource_demand_scheduler.py:219 -- Unfulfilled demands: []
example-cluster,ray:2021-08-11 16:34:27,693	DEBUG resource_demand_scheduler.py:241 -- Node requests: {}
example-cluster,ray:2021-08-11 16:34:27,702	ERROR autoscaler.py:306 -- StandardAutoscaler: example-cluster-ray-worker-type-fnftz: Terminating. Failed to setup/initialize node.
example-cluster,ray:2021-08-11 16:34:27,708	ERROR autoscaler.py:306 -- StandardAutoscaler: example-cluster-ray-worker-type-ptmtt: Terminating. Failed to setup/initialize node.
example-cluster,ray:2021-08-11 16:34:27,718	INFO node_provider.py:171 -- KubernetesNodeProvider: calling delete_namespaced_pod
example-cluster,ray:2021-08-11 16:34:27,731	INFO node_provider.py:171 -- KubernetesNodeProvider: calling delete_namespaced_pod
example-cluster,ray:2021-08-11 16:34:27,815	INFO autoscaler.py:354 -- 

======== Autoscaler status: 2021-08-11 16:34:27.815676 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/1.0 CPU
 0.00/0.350 GiB memory
 0.00/0.135 GiB object_store_memory

Demands:
 (no resource demands)
example-cluster,ray:2021-08-11 16:34:27,816	DEBUG legacy_info_string.py:24 -- Cluster status: 0 nodes (6 failed to update)
 - MostDelayedHeartbeats: {'172.17.0.4': 0.3220522403717041}
 - NodeIdleSeconds: Min=59 Mean=59 Max=59
 - ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/0.35 GiB memory, 0.0 GiB/0.13 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
example-cluster,ray:2021-08-11 16:34:27,853	INFO monitor.py:224 -- :event_summary:Removing 2 nodes of type rayWorkerType (launch failed).
example-cluster,ray:2021-08-11 16:34:32,828	ERROR monitor.py:285 -- Error in monitor loop
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 317, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 207, in _run
    self.update_load_metrics()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 170, in update_load_metrics
    request, timeout=4)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1628724872.826523900","description":"Error received from peer ipv4:172.17.0.4:38277","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Socket closed","grpc_status":14}"
>
Process example-cluster,ray:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 317, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 207, in _run
    self.update_load_metrics()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 170, in update_load_metrics
    request, timeout=4)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1628724872.826523900","description":"Error received from peer ipv4:172.17.0.4:38277","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Socket closed","grpc_status":14}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1198, in get_connection
    if connection.can_read():
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 734, in can_read
    return self._parser.can_read(timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 417, in can_read
    raise_on_timeout=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 429, in read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 87, in _create_or_update
    self.start_monitor()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 125, in start_monitor
    self.mtr.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 319, in run
    self._handle_failure(traceback.format_exc())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 296, in _handle_failure
    _internal_kv_put(DEBUG_AUTOSCALING_ERROR, message, overwrite=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/internal_kv.py", line 57, in _internal_kv_put
    key, "value", value)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1202, in get_connection
    connection.connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.17.0.4:6379. Connection refused.

omnific9 · August 11, 2021, 11:52pm

Also, here’s what I see about the two minikube nodes:

➜  charts git:(master) ✗ minikube status -p minikube      
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured

minikube-m02
type: Worker
host: Running
kubelet: Running

➜  charts git:(master) ✗ kubectl -n ray get nodes                            
NAME           STATUS   ROLES    AGE    VERSION
minikube       Ready    master   300d   v1.19.2
minikube-m02   Ready    <none>   2d2h   v1.19.2

Topic		Replies	Views
Deploying ray serve on Kubernetes Ray Serve	5	864	March 3, 2021
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	743	November 20, 2023
Autoscaler does not seem to watch head node Kubernetes	5	740	March 26, 2021
Testing autoscaler Kubernetes	15	1559	March 16, 2021
Stuck trying to take down workers Kubernetes	22	2365	March 29, 2021

Error Scaling Ray Serve to 2 Replicas

Related topics