How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello,I want to use parameter server to speed the training process ,and I created a ray cluster to complete my task . It works well when I use 10 or 20 or even 30 workers.But when I increase the number of workers,an error occurred like:
2022-11-10 21:05:26,857 WARNING worker.py:1829 -- The node with node id:
0025330f481cea913acf4a3b3c207cb8b111e5db20bd418ff6c85f21 and address: 12.8.14.75 and node name: 12.8.14.75 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload.
Traceback (most recent call last):
File "PS_mod_X.py", line 610, in <module>
model.set_weights(ray.get(current_weights))
File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/worker.py", line 2275, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::ParameterServer.apply_gradients() (pid=46604, ip=12.8.15.50, repr=<PS_mod_X.ParameterServer object at 0x2b36efa437f0>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: DataWorker
actor_id: d722c2ce069bf8245095d22801000000
pid: 73805
namespace: 2256a23b-515f-419d-aebb-3643a82bf79b
ip: 12.8.14.75
The actor is dead because its node has died. Node Id: 0025330f481cea913acf4a3b3c207cb8b111e5db20bd418ff6c85f21
Traceback (most recent call last):
File "PS_mod_X.py", line 610, in <module>
model.set_weights(ray.get(current_weights))
File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/worker.py", line 2277, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: ParameterServer
actor_id: 9c238342fbe13eeae4e4334c01000000
pid: 89719
namespace: db1c534d-af19-4936-887e-6681244de81d
ip: 12.7.16.12
The actor is dead because its node has died. Node Id: 91875c8e405b40bb8c9b32f10d20f5768ce6702d3ebbd2b95167ecb8
I’m sure the node’s memory is not full,and I think maybe this happens due to the heavy load which makes the raylet have lagging heartbeats ,so I increased the num_heartbeats_timeout using the below script(the parameter --system-config).
srun --nodes=1 --ntasks=1 -w $node1 ray start --node-ip-address=$ip_prefix --block --head --port=6379 --num-gpus=1 --num-cpus=1 --object-store-memory=65000000000 --system-config={"num_heartbeats_timeout":300} >ray_headX.log 2>&1 & # Starting the head
But unfortunately, an error occurred and Ray cluster is not found at 12.7.13.73:6379.
Traceback (most recent call last):
File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/gcs_utils.py", line 120, in check_health
resp = stub.CheckAlive(req, timeout=timeout)
File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1668323012.889256827","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1668323012.889255767","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
+ python -u PS.py
+ set +x
Could you please help me to solve this problem?