Error occurs under high memory use

Hi,

  • None: Just asking a question out of curiosity

I am new to Ray. I was testing it out and got the following error. The error only occurs under high memory use.

System I am using

Microsoft Windows 10 Pro Version 10.0.19044 Build 19044
Python 3.7.9; ray==1.13.0
Intel Xeon E5-2630 v4 @2.2GHz * 2
Physical memory 64 GB, Virtual memory = 110GB (SSD)

The code is given below.

import numpy as np
import os
import time
import ray

def poly2d(x,y,z,order):
A =
for i in range(order+1):
for j in range(i+1):
A.append(x**(i-j) * y**(j))
A = np.array(A).T
coeff, r, rank, s = np.linalg.lstsq(A, z, rcond=None)
return coeff

def polyval2d(x,y,coeff,order):
z = 0
count = 0
for i in range(order+1):
for j in range(i+1):
z = z + x**(i-j) * y**(j) * coeff[count]
count += 1
return z

@ray.remote
def calc():
ni = 500
[x,y] = np.meshgrid(np.linspace(-1,1,ni),np.linspace(-1,1,ni))
z = np.sqrt(x2 + y2) + np.random.normal(0,0.1,(ni,ni))
coeff = poly2d(x.flatten(),y.flatten(),z.flatten(),18)
z1 = polyval2d(x.flatten(),y.flatten(),coeff,18)
z1 = z1.reshape(ni,ni)
return [np.sqrt(np.mean((z-z1)**2))]

def test():
start_time = time.time()
results = ray.get([calc.remote() for _ in range(os.cpu_count())])
duration = time.time() - start_time
print(‘Remote execution time: {}’.format(duration))

print(results)

try:
ray.init()
test()
ray.shutdown()
except Exception:
ray.shutdown()

The error I am getting.

(pid=) [2022-07-23 14:47:29,885 E 11508 13272] (raylet.exe) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.
(pid=) [2022-07-23 14:47:30,700 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnknown: RPC Error message: Stream removed; RPC Error details:
(pid=) [2022-07-23 14:47:33,745 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:33,768 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:34,326 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:34,789 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:36,612 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:37,336 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:38,973 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:39,881 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:40,666 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:41,826 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:42,870 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:44,266 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:45,235 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:46,488 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:47,996 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:48,968 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:50,241 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:51,757 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:52,860 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:54,089 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:55,863 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:56,975 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:57,850 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:47:58,944 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:00,099 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:01,075 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:02,229 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:05,557 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:05,558 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:05,559 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:06,323 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:07,383 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:08,432 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
2022-07-23 14:48:11,909 WARNING worker.py:1404 – The node with node id: d429ece4708fb61efa00bbba0e320ea111bb89e4d8e49b59a920e142 and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(pid=) [2022-07-23 14:48:09,428 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:10,441 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=) [2022-07-23 14:48:11,461 E 37748 31676] (gcs_server.exe) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:

Hey @mathew , thanks for sharing the context. Would you also share the dashboard_agent.log? Looks like the dashboard agent dies first.

(Default should be in /tmp/ray/session_latest/logs/dashboard_agent.log) Or with ray logs dashboard_agent.log if you are on nightly.

2022-07-23 23:15:17,178 INFO agent.py:109 – Dashboard agent grpc address: 127.0.0.1:56805
2022-07-23 23:15:17,180 INFO utils.py:99 – Get all modules by type: DashboardAgentModule
2022-07-23 23:15:17,193 INFO utils.py:111 – Module ray.dashboard.modules.actor.actor_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,583 INFO utils.py:111 – Module ray.dashboard.modules.event.event_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,600 INFO utils.py:111 – Module ray.dashboard.modules.job.job_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,612 INFO utils.py:111 – Module ray.dashboard.modules.log.log_agent cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,616 INFO utils.py:111 – Module ray.dashboard.modules.log.log_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,623 INFO utils.py:111 – Module ray.dashboard.modules.node.node_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,635 INFO utils.py:111 – Module ray.dashboard.modules.reporter.reporter_agent cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘opencensus’
2022-07-23 23:15:17,639 INFO utils.py:111 – Module ray.dashboard.modules.reporter.reporter_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,660 INFO utils.py:111 – Module ray.dashboard.modules.serve.serve_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,668 INFO utils.py:111 – Module ray.dashboard.modules.snapshot.snapshot_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,675 INFO utils.py:111 – Module ray.dashboard.modules.state.state_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,682 INFO utils.py:111 – Module ray.dashboard.modules.test.test_agent cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,688 INFO utils.py:111 – Module ray.dashboard.modules.test.test_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,693 INFO utils.py:111 – Module ray.dashboard.modules.test.test_utils cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘async_timeout’
2022-07-23 23:15:17,700 INFO utils.py:111 – Module ray.dashboard.modules.tune.tune_head cannot be loaded because we cannot import all dependencies. Download pip install ray[default] for the full dashboard functionality. Error: No module named ‘aiohttp’
2022-07-23 23:15:17,708 INFO utils.py:132 – Available modules: [<class ‘ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent’>]
2022-07-23 23:15:17,708 INFO agent.py:130 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent’>
2022-07-23 23:15:17,714 INFO agent.py:134 – Loaded 1 modules.

@mathew Thanks for the logs - a couple of follow-up questions from this:

  1. Was this log above produced when the system crashed with high memory usage as well?
  2. If you start your workload script w/o dashboard, i.e. ray.init(include_dashboard=False), does the error still occur?