Cannot launch ray crash course notebook, ray.init fails, Connection reset by peer

I am having trouble to getting start with the Ray crash course from GitHub - anyscale/academy: Ray tutorials from Anyscale. I am setting up a virtual environment using:

conda env create -f environment.yml
conda activate anyscale-academy
tools/fix-jupyter.sh

but when I run

import ray
ray.init(ignore_reinit_error=True)

I got an error message of

2021-02-01 20:02:44,716	INFO resource_spec.py:231 -- Starting Ray with 4.79 GiB memory available for workers and up to 2.41 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py in read_from_socket(self, timeout, raise_on_timeout)
    419             if HIREDIS_USE_BYTE_BUFFER:
--> 420                 bufflen = recv_into(self._sock, self._buffer)
    421                 if bufflen == 0:

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/_compat.py in recv_into(sock, *args, **kwargs)
     73     def recv_into(sock, *args, **kwargs):
---> 74         return sock.recv_into(*args, **kwargs)
     75 

ConnectionResetError: [Errno 54] Connection reset by peer

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
<ipython-input-3-5a82cb60c68c> in <module>
----> 1 ray.init(ignore_reinit_error=True)

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/worker.py in init(address, redis_address, redis_port, num_cpus, num_gpus, memory, object_store_memory, resources, driver_object_store_memory, redis_max_memory, log_to_driver, node_ip_address, object_ref_seed, local_mode, redirect_worker_output, redirect_output, ignore_reinit_error, num_redis_shards, redis_max_clients, redis_password, plasma_directory, huge_pages, include_java, include_dashboard, dashboard_host, dashboard_port, job_id, configure_logging, logging_level, logging_format, plasma_store_socket_name, raylet_socket_name, temp_dir, load_code_from_local, java_worker_options, use_pickle, _internal_config, lru_evict, enable_object_reconstruction)
    725             shutdown_at_exit=False,
    726             spawn_reaper=True,
--> 727             ray_params=ray_params)
    728     else:
    729         # In this case, we are connecting to an existing cluster.

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/node.py in __init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    192         # Start processes.
    193         if head:
--> 194             self.start_head_processes()
    195             redis_client = self.create_redis_client()
    196             redis_client.set("session_name", self.session_name)

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/node.py in start_head_processes(self)
    744         assert self._redis_address is None
    745         # If this is the head node, start the relevant head node processes.
--> 746         self.start_redis()
    747 
    748         self.start_gcs_server()

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/node.py in start_redis(self)
    536              password=self._ray_params.redis_password,
    537              include_java=self._ray_params.include_java,
--> 538              fate_share=self.kernel_fate_share)
    539         assert (
    540             ray_constants.PROCESS_TYPE_REDIS_SERVER not in self.all_processes)

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/services.py in start_redis(node_ip_address, redirect_files, resource_spec, port, redis_shard_ports, num_redis_shards, redis_max_clients, redirect_worker_output, password, use_credis, include_java, fate_share)
    789     primary_redis_client = redis.StrictRedis(
    790         host=node_ip_address, port=port, password=password)
--> 791     primary_redis_client.set("NumRedisShards", str(num_redis_shards))
    792 
    793     # Put the redirect_worker_output bool in the Redis shard so that workers

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/client.py in set(self, name, value, ex, px, nx, xx)
   1764         if xx:
   1765             pieces.append('XX')
-> 1766         return self.execute_command('SET', *pieces)
   1767 
   1768     def __setitem__(self, name, value):

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/client.py in execute_command(self, *args, **options)
    873         pool = self.connection_pool
    874         command_name = args[0]
--> 875         conn = self.connection or pool.get_connection(command_name, **options)
    876         try:
    877             conn.send_command(*args)

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py in get_connection(self, command_name, *keys, **options)
   1183             try:
   1184                 # ensure this connection is connected to Redis
-> 1185                 connection.connect()
   1186                 # connections that the pool provides should be ready to send
   1187                 # a command. if not, the connection was either returned to the

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py in connect(self)
    559         self._sock = sock
    560         try:
--> 561             self.on_connect()
    562         except RedisError:
    563             # clean up after any error in on_connect

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py in on_connect(self)
    635 
    636             try:
--> 637                 auth_response = self.read_response()
    638             except AuthenticationWrongNumberOfArgsError:
    639                 # a username and password were specified but the Redis

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py in read_response(self)
    732         "Read the response from a previously sent command"
    733         try:
--> 734             response = self._parser.read_response()
    735         except socket.timeout:
    736             self.disconnect()

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py in read_response(self)
    461         response = self._reader.gets()
    462         while response is False:
--> 463             self.read_from_socket()
    464             response = self._reader.gets()
    465         # if an older version of hiredis is installed, we need to attempt

~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py in read_from_socket(self, timeout, raise_on_timeout)
    444                 return False
    445             raise ConnectionError("Error while reading from socket: %s" %
--> 446                                   (ex.args,))
    447         finally:
    448             if custom_timeout:

ConnectionError: Error while reading from socket: (54, 'Connection reset by peer')

I am using a Mac os, and Anaconda. I am not sure how to trouble shoot this. Help please.

@sangcho Can you help me take a look at this? Thank you very much!!!

Are you running this locally on your laptop?

Yes, I am running this on a Macbook pro with Catalina 10.15.5. Please let me know what other information I can provide you.

If you run it without a jupyter, do you still see the issue?

I set up a toy example from the ray website and run it in the terminal with the virtual environment activated, as:

import time
import ray

ray.init()

@ray.remote
def do_some_work(x):
    time.sleep(1) # Replace this is with work you need to do.
    return x

if __name__ == "__main__":
    start = time.time()
    results = [do_some_work.remote(x) for x in range(4)]
    print("duration =", time.time() - start)
    print("results = ", results)

using the virtual environment setup as in the yml file:

name: anyscale-academy
channels:
  - conda-forge
  - pyviz
dependencies:
  - python=3.7
  - pip
  - gym >= 0.17.2
  - numpy >= 1.18.5
  - pandas
  - requests
  - pytorch
  - torchvision
  - tqdm >= 4.37.0
  - keras
  - scikit-learn
  - holoviews
  - bokeh
  - ipywidgets
  - psutil
  - jupyterlab
  - jupyter-server-proxy
  - beautifulsoup4
  - lxml
  - pytz
  - nodejs
  - pip:
    - ray[all]==0.8.7
    - tensorboard >= 2.3.0
    - tensorflow >= 2.3.0
    - atoma
    - box2d-py

Here is the error message:

(anyscale-academy) tenggao@TENGMGAO-MB0 ray_tune_trial % python ray_sample.py
File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192'
Traceback (most recent call last):
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py", line 427, in read_from_socket
    bufflen = recv_into(self._sock, self._buffer)
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/_compat.py", line 75, in recv_into
    return sock.recv_into(*args, **kwargs)
ConnectionResetError: [Errno 54] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ray_sample.py", line 4, in <module>
    ray.init()
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/worker.py", line 722, in init
    ray_params=ray_params)
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/node.py", line 216, in __init__
    self.start_head_processes()
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/node.py", line 767, in start_head_processes
    self.start_redis()
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/node.py", line 597, in start_redis
    fate_share=self.kernel_fate_share)
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/_private/services.py", line 851, in start_redis
    primary_redis_client.set("NumRedisShards", str(num_redis_shards))
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/client.py", line 1801, in set
    return self.execute_command('SET', *pieces)
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py", line 567, in connect
    self.on_connect()
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py", line 643, in on_connect
    auth_response = self.read_response()
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py", line 470, in read_response
    self.read_from_socket()
  File "/Users/tenggao/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/redis/connection.py", line 453, in read_from_socket
    (ex.args,))
redis.exceptions.ConnectionError: Error while reading from socket: (54, 'Connection reset by peer')
zsh: terminated  python ray_sample.py

I had tried starting the redis server before running the sample and without, and restarting my computer, and none seems to help.
Thanks in advance, and really appreciate your help.

Are you in our public slack channel? Can you send me a DM there? (@sangcho)

I think I can try looking at it in a call or something if you don’t mind

Yes, I am on the public slack channel. Thank you and really appreciate your time, I will send you a DM now.

1 Like

@Mike_Gao and I disucssed this offline. The issue was that the public IP detected by Ray wasn’t applicable to his machine because of VPN he’s using.

I had a similar issue due to my company’s network security.

When ray initializes, it attempts to establish a server IP that other processes (even on other hosts) can connect to. It locates an IP address for this server by opening a socket to “8.8.8.8:53” (Google DNS) and examining what its own resulting IP is for that socket. This, naturally, returns the VPN controlled IP address, because that connection was reaching out to the Internet.

Subsequently, when ray and redis attempt to contact each other across that IP, our VPN said “No, you won’t” and closed the connection.

I have found a workaround to hack the node IP address ray uses to start this conversation, setting it to “0.0.0.0”.

ray.init(_node_ip_address="0.0.0.0")

This avoids the VPN intervention and since we are not starting a ray server for general use, the local connections work fine. I don’t do this in production, because it is only needed for development.

1 Like