Hmm. The forum wouldn’t let me post a long message even if I limit the post length. Here’s to try to post a small portion of the error message again from the logs:
======== Autoscaler status: 2021-08-11 16:34:22.421266 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
Pending:
172.17.0.5: rayWorkerType, setting-up
172.17.0.7: rayWorkerType, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/1.0 CPU
0.00/0.350 GiB memory
0.00/0.135 GiB object_store_memory
Demands:
(no resource demands)
example-cluster,ray:2021-08-11 16:34:22,444 DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes (2 updating) (4 failed to update)
- MostDelayedHeartbeats: {'172.17.0.4': 0.254608154296875}
- NodeIdleSeconds: Min=54 Mean=54 Max=54
- ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/0.35 GiB memory, 0.0 GiB/0.13 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- rayWorkerType: 2
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
sock = self._connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
raise err
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 666, in wait_for_redis_to_start
redis_client.client_list()
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 1194, in client_list
return self.execute_command('CLIENT LIST')
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
connection.connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 563, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.17.0.4:6379. Connection refused.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1808, in main
return cli()
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 656, in start
redis_address_ip, redis_address_port, password=redis_password)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 688, in wait_for_redis_to_start
" attempts to ping the Redis server.") from connEx
RuntimeError: Unable to connect to Redis at 172.17.0.4:6379 after 12 retries. Check that 172.17.0.4:6379 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable `RAY_START_REDIS_WAIT_RETRIES` to increase the number of attempts to ping the Redis server.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
sock = self._connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
raise err
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 666, in wait_for_redis_to_start
redis_client.client_list()
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 1194, in client_list
return self.execute_command('CLIENT LIST')
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
connection.connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 563, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.17.0.4:6379. Connection refused.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1808, in main
return cli()
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 656, in start
redis_address_ip, redis_address_port, password=redis_password)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 688, in wait_for_redis_to_start
" attempts to ping the Redis server.") from connEx
RuntimeError: Unable to connect to Redis at 172.17.0.4:6379 after 12 retries. Check that 172.17.0.4:6379 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable `RAY_START_REDIS_WAIT_RETRIES` to increase the number of attempts to ping the Redis server.
command terminated with exit code 1
command terminated with exit code 1
Exception in thread Thread-15:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 134, in run
self.do_update()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 468, in do_update
run_env="auto")
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 178, in run
self.process_runner.check_call(final_cmd, shell=True)
File "/home/ray/anaconda3/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'kubectl -n ray exec -it example-cluster-ray-worker-type-fnftz -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1,"memory":375809638}'"'"';export RAY_HEAD_IP=172.17.0.4; ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379)'' returned non-zero exit status 1.
Exception in thread Thread-16:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 134, in run
self.do_update()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 468, in do_update
run_env="auto")
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 178, in run
self.process_runner.check_call(final_cmd, shell=True)
File "/home/ray/anaconda3/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'kubectl -n ray exec -it example-cluster-ray-worker-type-ptmtt -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1,"memory":375809638}'"'"';export RAY_HEAD_IP=172.17.0.4; ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379)'' returned non-zero exit status 1.
example-cluster,ray:2021-08-11 16:34:27,660 DEBUG resource_demand_scheduler.py:160 -- Cluster resources: [{'object_store_memory': 144745267.0, 'node:172.17.0.4': 1.0, 'CPU': 1.0, 'memory': 375809638.0}, {'CPU': 1, 'memory': 375809638}, {'CPU': 1, 'memory': 375809638}]
example-cluster,ray:2021-08-11 16:34:27,661 DEBUG resource_demand_scheduler.py:161 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
example-cluster,ray:2021-08-11 16:34:27,661 DEBUG resource_demand_scheduler.py:172 -- Placement group demands: []
example-cluster,ray:2021-08-11 16:34:27,661 DEBUG resource_demand_scheduler.py:218 -- Resource demands: []
example-cluster,ray:2021-08-11 16:34:27,661 DEBUG resource_demand_scheduler.py:219 -- Unfulfilled demands: []
example-cluster,ray:2021-08-11 16:34:27,693 DEBUG resource_demand_scheduler.py:241 -- Node requests: {}
example-cluster,ray:2021-08-11 16:34:27,702 ERROR autoscaler.py:306 -- StandardAutoscaler: example-cluster-ray-worker-type-fnftz: Terminating. Failed to setup/initialize node.
example-cluster,ray:2021-08-11 16:34:27,708 ERROR autoscaler.py:306 -- StandardAutoscaler: example-cluster-ray-worker-type-ptmtt: Terminating. Failed to setup/initialize node.
example-cluster,ray:2021-08-11 16:34:27,718 INFO node_provider.py:171 -- KubernetesNodeProvider: calling delete_namespaced_pod
example-cluster,ray:2021-08-11 16:34:27,731 INFO node_provider.py:171 -- KubernetesNodeProvider: calling delete_namespaced_pod
example-cluster,ray:2021-08-11 16:34:27,815 INFO autoscaler.py:354 --
======== Autoscaler status: 2021-08-11 16:34:27.815676 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/1.0 CPU
0.00/0.350 GiB memory
0.00/0.135 GiB object_store_memory
Demands:
(no resource demands)
example-cluster,ray:2021-08-11 16:34:27,816 DEBUG legacy_info_string.py:24 -- Cluster status: 0 nodes (6 failed to update)
- MostDelayedHeartbeats: {'172.17.0.4': 0.3220522403717041}
- NodeIdleSeconds: Min=59 Mean=59 Max=59
- ResourceUsage: 0.0/1.0 CPU, 0.0 GiB/0.35 GiB memory, 0.0 GiB/0.13 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
example-cluster,ray:2021-08-11 16:34:27,853 INFO monitor.py:224 -- :event_summary:Removing 2 nodes of type rayWorkerType (launch failed).
example-cluster,ray:2021-08-11 16:34:32,828 ERROR monitor.py:285 -- Error in monitor loop
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 317, in run
self._run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 207, in _run
self.update_load_metrics()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 170, in update_load_metrics
request, timeout=4)
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1628724872.826523900","description":"Error received from peer ipv4:172.17.0.4:38277","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Socket closed","grpc_status":14}"
>
Process example-cluster,ray:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 317, in run
self._run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 207, in _run
self.update_load_metrics()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 170, in update_load_metrics
request, timeout=4)
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1628724872.826523900","description":"Error received from peer ipv4:172.17.0.4:38277","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Socket closed","grpc_status":14}"
>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1198, in get_connection
if connection.can_read():
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 734, in can_read
return self._parser.can_read(timeout)
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 417, in can_read
raise_on_timeout=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 429, in read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
sock = self._connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
raise err
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 87, in _create_or_update
self.start_monitor()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 125, in start_monitor
self.mtr.run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 319, in run
self._handle_failure(traceback.format_exc())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 296, in _handle_failure
_internal_kv_put(DEBUG_AUTOSCALING_ERROR, message, overwrite=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/internal_kv.py", line 57, in _internal_kv_put
key, "value", value)
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 3050, in hset
return self.execute_command('HSET', name, *items)
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1202, in get_connection
connection.connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 563, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.17.0.4:6379. Connection refused.