I am running ray nightly build and after a few training iterations, ray crashes down with the following error:
Demands:
(no resource demands)
2022-02-06 14:23:30,456 ERROR monitor.py:395 -- Monitor: Execution exception. Trying again...
Traceback (most recent call last):
File "/home/exx/miniconda3/envs/carla/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 363, in _run
self.update_load_metrics()
File "/home/exx/miniconda3/envs/carla/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 269, in update_load_metrics
response = self.gcs_node_resources_stub.GetAllResourceUsage(request, timeout=60)
File "/home/exx/miniconda3/envs/carla/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/exx/miniconda3/envs/carla/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1644182610.456250161","description":"Error received from peer ipv4:10.218.105.239:60750","file":"src/core/lib/surface/call.cc","file_line":1064,"grpc_message":"Socket closed","grpc_status":14}"
>
2022-02-06 14:23:30,560 ERROR monitor.py:450 -- Error in monitor loop
NoneType: None
2022-02-06 14:23:30,561 ERROR gcs_utils.py:157 -- Connecting to gcs failed. Error <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Broken pipe"
debug_error_string = "{"created":"@1644182610.561153441","description":"Error received from peer ipv4:10.218.105.239:60750","file":"src/core/lib/surface/call.cc","file_line":1064,"grpc_message":"Broken pipe","grpc_status":14}"
This crash is not consistent. Any tips to resolve the issue? Thanks!