Hello all! Ray novice here.
I have encountered a curious bug. I have a large array of strings, I pass this large array
to a function that calculates some numbers based on the string. Calculations are quite expensive,
so for my trial array I need 45 min calculation time on a 4x64 core AWS EC2 instance cluster.
Ray performs wonderfully.
However, as soon as I increase the array size (in this case, from 190 MB to 1.9 GB, shortly after I start
the processing script, the cluster crashes and I cannot connect to it anymore.
I run the script in a tmux session on the head node. This is the output I get from this session:
ENGAGE! # output of the script, telling me its started to run
Shared connection to 3.125.45.70 closed.
Error: Command failed: "ip-172-31-30-166" 13:46 17-Mar-21
ssh -tt -i /home/msl/VLX.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8cf205e11d/afdaa39097/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.125.45.70 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && ($SHELL)'"'"'"'"'"'"'"'"''"'"' )'
When I try to connect to dashboard this is what I get:
ray dashboard config2.yaml
Attempting to establish dashboard locally at localhost:8265 connected to remote port 8265
2021-03-17 21:47:25,895 VWARN commands.py:255 -- Loaded cached provider configuration from /tmp/ray-config-07e1cc0e8b2c9c9bf8c2776e266a60f754847d4c
2021-03-17 21:47:25,895 WARN commands.py:260 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2021-03-17 21:47:27,065 INFO command_runner.py:356 -- Fetched IP: 3.125.45.70
2021-03-17 21:47:27,067 INFO log_timer.py:25 -- NodeUpdater: i-099977f1d0c388556: Got IP [LogTimer=2ms]
2021-03-17 21:47:27,068 INFO command_runner.py:484 -- Forwarding ports
2021-03-17 21:47:27,069 VINFO command_runner.py:488 -- Forwarding port 8265 to port 8265 on localhost.
2021-03-17 21:47:27,071 VINFO command_runner.py:508 -- Running `None`
2021-03-17 21:47:27,072 VVINFO command_runner.py:510 -- Full command is `ssh -tt -L 8265:localhost:8265 -i /home/msl/VLX.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8cf205e11d/afdaa39097/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.125.45.70 while true; do sleep 86400; done`
mux_client_request_session: read from master failed: Broken pipe
Connection timed out during banner exchange
Error: Failed to forward dashboard from remote port 8265 to local port 8265. There are a couple possibilities:
1. The remote port is incorrectly specified
2. The local port 8265 is already in use.
The exception is: Command failed:
ssh -tt -L 8265:localhost:8265 -i /home/msl/VLX.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8cf205e11d/afdaa39097/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.125.45.70 while true; do sleep 86400; done
The code I am running:
ray.init(address='auto')
@ray.remote
def calculateStuff(array,index):
sequence = array[index]
throwlist = somecalc(sequence)
return throwlist
permutlist_id = ray.put(permutlist)
result_ids = []
result_ids = [calculateStuff.remote(permutlist_id, i) for i in range(0,len(permutlist))]
The large array is permutlist.
Has anyone else experienced Ray crashing at a certain data size? I mean I get at a certain point, but 2 GB should be doable, right?
EDIT: Forgot to add specs. I am starting and controlling the cluster out of Ubuntu 20.04.
Cluster runs Ray 1.2.0 on an AWS EC2 instance running Ubuntu 20.04 LTS.