When I increase the number of workers, the actor died and the parameter server failed due to lagging heartbeats

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello,I want to use parameter server to speed the training process ,and I created a ray cluster to complete my task . It works well when I use 10 or 20 or even 30 workers.But when I increase the number of workers,an error occurred like:

2022-11-10 21:05:26,857 WARNING worker.py:1829 -- The node with node id: 
0025330f481cea913acf4a3b3c207cb8b111e5db20bd418ff6c85f21 and address: 12.8.14.75 and node name: 12.8.14.75 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a    (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload.
Traceback (most recent call last):
  File "PS_mod_X.py", line 610, in <module>
    model.set_weights(ray.get(current_weights))
  File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/worker.py", line 2275, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::ParameterServer.apply_gradients() (pid=46604, ip=12.8.15.50, repr=<PS_mod_X.ParameterServer object at 0x2b36efa437f0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: DataWorker
        actor_id: d722c2ce069bf8245095d22801000000
        pid: 73805
        namespace: 2256a23b-515f-419d-aebb-3643a82bf79b
        ip: 12.8.14.75
The actor is dead because its node has died. Node Id: 0025330f481cea913acf4a3b3c207cb8b111e5db20bd418ff6c85f21
Traceback (most recent call last):
  File "PS_mod_X.py", line 610, in <module>
    model.set_weights(ray.get(current_weights))
  File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: ParameterServer
        actor_id: 9c238342fbe13eeae4e4334c01000000
        pid: 89719
        namespace: db1c534d-af19-4936-887e-6681244de81d
        ip: 12.7.16.12
The actor is dead because its node has died. Node Id: 91875c8e405b40bb8c9b32f10d20f5768ce6702d3ebbd2b95167ecb8

I’m sure the node’s memory is not full,and I think maybe this happens due to the heavy load which makes the raylet have lagging heartbeats ,so I increased the num_heartbeats_timeout using the below script(the parameter --system-config).

srun --nodes=1 --ntasks=1  -w $node1 ray start --node-ip-address=$ip_prefix --block --head --port=6379 --num-gpus=1 --num-cpus=1 --object-store-memory=65000000000 --system-config={"num_heartbeats_timeout":300} >ray_headX.log 2>&1 & # Starting the head

But unfortunately, an error occurred and Ray cluster is not found at 12.7.13.73:6379.

Traceback (most recent call last):
  File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/ray/_private/gcs_utils.py", line 120, in check_health
    resp = stub.CheckAlive(req, timeout=timeout)
  File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/public/lhl/envs/FBGAN_for_lhl/lib/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1668323012.889256827","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1668323012.889255767","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
+ python -u PS.py
+ set +x  

Could you please help me to solve this problem?

@lihuiling
Could you check the logs in /tmp/ray/session_latest/logs/gcs_server.* and share it with me?

Hi @lihuiling, any updates on your end?

By the way, I believe --system-config takes in a JSON-formatted dictionary, so that may be why you’re getting the second error about Ray not being able to start. What was the output of the ray start command? You should see an error about this.

You can try the environment variable RAY_num_heartbeats_timeout=300 instead.