How to understand the "unexpected error" messaging

vnlitvinov · October 11, 2024, 4:09pm

Hi all,

We’re facing seemingly random issues when running relatively simple tasks via Ray Job Submit API.

Reported error is Unexpected error occurred: The actor died unexpectedly before finishing the task, apparently caused by The actor died because its node has died (which in turn happened because node missed too many heartbeats and was marked dead).

We structure our cluster as having one centreal head node (marked as having 0 resources, so no tasks scheduled there by default) and some workers, and we have a script that assigns tasks to workers (using scheduling policies to ensure each worker gets its own individual task). Now I do understand that it’s not the exact way Ray was designed to use, but we have some reasons to do so.

The thing is, I never saw such errors if I use just python test.py (from the head node) to submit tasks - as the job just continues to “run” waiting for the resource, but we sometimes see such issues if we submit the script using Job submit API.

Could you help me with correctly interpreting the error message and rectify if I’m doing something entirely wrong?

Thanks in advance,
Vasilii

Ruiyang_Wang · November 5, 2024, 7:52pm

Hi am I reading it correctly that:

you have a head node with 0 resources + some worker nodes with resources
you run python test.py on head node, which schedules actors, and it works file
you submit test.py as a job, and you find actors and the nodes dead

?

This sounds strange to me since a same script is not expected to bring down nodes anyhow. Can you share a repro script for this? Thanks.

vnlitvinov · November 6, 2024, 10:07am

Hi @Ruiyang_Wang, thanks for responding!

I don’t think we run options 2 and 3 on the same cluster, it’s either one or two. We now mostly use option 3.
The script doesn’t directly use actors concept, it uses @ray.remote decorator on some functions.

Kindly find somewhat simplified script below:

import ray
import os
import time
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy
from ray._private.resource_spec import HEAD_NODE_RESOURCE_NAME

@ray.remote(num_gpus=1)
def test_task(gpu_id, node_name_or_id):
    try:
        import torch
        # do some simple stuff
        return "foo"
    except ImportError:
        return f"PyTorch not installed on node {node_name_or_id}"
    except RuntimeError as e:
        return f"exception {e}"

def create_task_with_node_affinity(node_id, node_name_or_id, gpu_id, num_gpus):
    scheduling_strategy = NodeAffinitySchedulingStrategy(
        node_id=node_id, soft=True
    )
    return test_task.options(scheduling_strategy=scheduling_strategy, num_gpus=num_gpus).remote(gpu_id, node_name_or_id)

def run_test_on_all_nodes():
    nodes_info = ray.nodes()
    tasks = []
    for node in nodes_info:
        if node['alive']:
            node_id = node['NodeID']
            if HEAD_NODE_RESOURCE_NAME in node['Resources']:
                print(f'Skip testing {node_id} as it is a headnode')
                continue
            node_name_or_id = node.get('NodeName', node_id)
            num_gpus = int(node.get('Resources', {}).get('GPU', 0))
            for gpu_id in range(num_gpus):
                tasks.append(create_task_with_node_affinity(node_id, node_name_or_id, gpu_id, num_gpus))

    results = ray.get(tasks)
    for result in results:
        print(result)

if __name__ == "__main__":
    ray.init()
    run_test_on_all_nodes()

and this script is ran via submit API using POST request to <dashboard_url>/api/jobs/ with payload being {"entrypoint": f"python -c {shlex.quote(script)}"}.

P.S. Part of the problem here is that it’s not happening all the time, it’s somewhat random AFAIK.

Ruiyang_Wang · November 6, 2024, 7:53pm

Hi previously we found a similar bug in job submission, can you try to use ray 2.38 to see if the problem persists?

Topic		Replies	Views
[RaySGD] Training instability Ray Train	6	1054	March 17, 2021
The pending tasks/actors remain on Ray Cluster when the driver die unexpected Ray Core	13	2522	February 6, 2023
RayActorError: The actor died unexpectedly before finishing this task	2	1500	November 1, 2022
Ray jobs failing after 250 jobs Ray Clusters	0	266	February 27, 2023
A worker died or was killed while executing a task by an unexpected system error Ray Tune	6	4186	May 8, 2023

How to understand the "unexpected error" messaging

Related topics