Nightly build ray crashes after few training iterations using RLLib

I am running ray nightly build and after a few training iterations, ray crashes down with the following error:

Demands:
 (no resource demands)
2022-02-06 14:23:30,456	ERROR monitor.py:395 -- Monitor: Execution exception. Trying again...
Traceback (most recent call last):
  File "/home/exx/miniconda3/envs/carla/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 363, in _run
    self.update_load_metrics()
  File "/home/exx/miniconda3/envs/carla/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 269, in update_load_metrics
    response = self.gcs_node_resources_stub.GetAllResourceUsage(request, timeout=60)
  File "/home/exx/miniconda3/envs/carla/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/exx/miniconda3/envs/carla/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1644182610.456250161","description":"Error received from peer ipv4:10.218.105.239:60750","file":"src/core/lib/surface/call.cc","file_line":1064,"grpc_message":"Socket closed","grpc_status":14}"
>
2022-02-06 14:23:30,560	ERROR monitor.py:450 -- Error in monitor loop
NoneType: None
2022-02-06 14:23:30,561	ERROR gcs_utils.py:157 -- Connecting to gcs failed. Error <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Broken pipe"
	debug_error_string = "{"created":"@1644182610.561153441","description":"Error received from peer ipv4:10.218.105.239:60750","file":"src/core/lib/surface/call.cc","file_line":1064,"grpc_message":"Broken pipe","grpc_status":14}"

This crash is not consistent. Any tips to resolve the issue? Thanks!

Could you please submit an issue in github with the script which can help reproduce this one?

I think this was an issue with my code. I made some changes and this error does not seem to occur anymore!

1 Like