How to understand the "unexpected error" messaging

Hi all,

We’re facing seemingly random issues when running relatively simple tasks via Ray Job Submit API.

Reported error is Unexpected error occurred: The actor died unexpectedly before finishing the task, apparently caused by The actor died because its node has died (which in turn happened because node missed too many heartbeats and was marked dead).

We structure our cluster as having one centreal head node (marked as having 0 resources, so no tasks scheduled there by default) and some workers, and we have a script that assigns tasks to workers (using scheduling policies to ensure each worker gets its own individual task). Now I do understand that it’s not the exact way Ray was designed to use, but we have some reasons to do so.

The thing is, I never saw such errors if I use just python test.py (from the head node) to submit tasks - as the job just continues to “run” waiting for the resource, but we sometimes see such issues if we submit the script using Job submit API.

Could you help me with correctly interpreting the error message and rectify if I’m doing something entirely wrong?

Thanks in advance,
Vasilii

Hi am I reading it correctly that:

  1. you have a head node with 0 resources + some worker nodes with resources
  2. you run python test.py on head node, which schedules actors, and it works file
  3. you submit test.py as a job, and you find actors and the nodes dead

?

This sounds strange to me since a same script is not expected to bring down nodes anyhow. Can you share a repro script for this? Thanks.

Hi @Ruiyang_Wang, thanks for responding!

I don’t think we run options 2 and 3 on the same cluster, it’s either one or two. We now mostly use option 3.
The script doesn’t directly use actors concept, it uses @ray.remote decorator on some functions.

Kindly find somewhat simplified script below:

import ray
import os
import time
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy
from ray._private.resource_spec import HEAD_NODE_RESOURCE_NAME

@ray.remote(num_gpus=1)
def test_task(gpu_id, node_name_or_id):
    try:
        import torch
        # do some simple stuff
        return "foo"
    except ImportError:
        return f"PyTorch not installed on node {node_name_or_id}"
    except RuntimeError as e:
        return f"exception {e}"

def create_task_with_node_affinity(node_id, node_name_or_id, gpu_id, num_gpus):
    scheduling_strategy = NodeAffinitySchedulingStrategy(
        node_id=node_id, soft=True
    )
    return test_task.options(scheduling_strategy=scheduling_strategy, num_gpus=num_gpus).remote(gpu_id, node_name_or_id)

def run_test_on_all_nodes():
    nodes_info = ray.nodes()
    tasks = []
    for node in nodes_info:
        if node['alive']:
            node_id = node['NodeID']
            if HEAD_NODE_RESOURCE_NAME in node['Resources']:
                print(f'Skip testing {node_id} as it is a headnode')
                continue
            node_name_or_id = node.get('NodeName', node_id)
            num_gpus = int(node.get('Resources', {}).get('GPU', 0))
            for gpu_id in range(num_gpus):
                tasks.append(create_task_with_node_affinity(node_id, node_name_or_id, gpu_id, num_gpus))

    results = ray.get(tasks)
    for result in results:
        print(result)

if __name__ == "__main__":
    ray.init()
    run_test_on_all_nodes()

and this script is ran via submit API using POST request to <dashboard_url>/api/jobs/ with payload being {"entrypoint": f"python -c {shlex.quote(script)}"}.

P.S. Part of the problem here is that it’s not happening all the time, it’s somewhat random AFAIK.

Hi previously we found a similar bug in job submission, can you try to use ray 2.38 to see if the problem persists?