TimeoutError: [WinError 10060]

Hi all,

I am using a policy client/server setup which works perfectly anywhere from 1-10 hours. However, eventually, I get the following error on the client’s at the same time:

Traceback (most recent call last):
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\util\connection.py", line 96, in create_connection
    raise err
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\util\connection.py", line 86, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected
 party did not properly respond after a period of time, or established connectio
n failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)

  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\http\client.py
", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\http\client.py
", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\http\client.py
", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\http\client.py
", line 1007, in _send_output
    self.send(msg)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\http\client.py
", line 947, in send
    self.connect()
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\connection.py", line 200, in connect
    conn = self._new_conn()
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\connection.py", line 181, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object
 at 0x00000000312110D0>: Failed to establish a new connection: [WinError 10060]
A connection attempt failed because the connected party did not properly respond
 after a period of time, or established connection failed because connected host
 has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
urllib3\util\retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.0.18', port=5
5556): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.
connection.HTTPConnection object at 0x00000000312110D0>: Failed to establish a n
ew connection: [WinError 10060] A connection attempt failed because the connecte
d party did not properly respond after a period of time, or established connecti
on failed because connected host has failed to respond'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "policy_client.py", line 138, in <module>
    action = client.get_action(episode_id=episode_id, observation=gameObservatio
n)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
ray\rllib\env\policy_client.py", line 129, in get_action
    return self._send({
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
ray\rllib\env\policy_client.py", line 222, in _send
    response = requests.post(self.address, data=payload)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
requests\api.py", line 117, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
requests\sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
requests\sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\Denys\AppData\Local\Programs\Python\Python38\lib\site-packages\
requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='192.168.0.18', por
t=55556): Max retries exceeded with url: / (Caused by NewConnectionError('<urlli
b3.connection.HTTPConnection object at 0x00000000312110D0>: Failed to establish
a new connection: [WinError 10060] A connection attempt failed because the conne
cted party did not properly respond after a period of time, or established conne
ction failed because connected host has failed to respond'))

As far as I can tell, this happens on different models, different numbers of successful iterations. Moreover, I doubt its because the policy server (ppo_trainer) is blocked doing an iteration because the reported learning time is about ~5 seconds.

Any ideas what might be causing this?

The policy server is running on win10, python 3.8. Lates dev rrllib wheel.

Hey @Denys_Ashikhin , this looks more like the PolicyServer “went away” and the PolicyClient cannot communicate with it anymore through the HTTP socket. Do you see any crashes/errors on the policy server side? Is it still up and running ok? Since you are saying that your clients all have the same problem at the same time, this indicates that your Server becomes unhealthy at some point.

Strangely enough, there are 0 issues on the server side. In fact, all I have to do is relaunch the clients and they keep going without issue for anywhere 1-10 hours like before. This is without me touching the policy_server. I also want to add that I am using remote_inference (since my environment needs only a small number of actions and the delay is fine).

So despite the weird issue, I don’t need to restart the server at all, it will continue to train and save checkpoints to disk as soon as it has enough data points. Only thing is, the clients need an occasional restart.

Edit:
I currently have 2-3 clients. 2 of which are inside an individual VM on a windows host (they crash at the same time). I have recorded using obs and the internet connection does not drop during the crash so its not that.
The other client is another windows host that I sometimes leave on. From what I can tell, the crash is localised per machine, however, I have not recorded both machines at the same time to see if they all crash at once, or it is localised to a host.