Ray stuck at the number of tasks reaching to 10000

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am working on computing constraint for a node in a graph using NetworkX library.

Following is the key part of my code

ray.init()
data = pd.read_csv('./connection.csv')
data.columns = ['src', 'dst', 'weight']
# build graph, which contains about 84000 nodes
G = nx.Graph()
for _, row in data.iterrows():
    G.add_edge(row['src'], row['dst'], weight=row['weight'])

g = ray.put(G)

@ray.remote
def get_constraint(v):
    G = ray.get(g)
    print(v)
    if len(G[v]) == 0:
        return {v: float("nan")}
    return {v: sum(
        local_constraint(G, v, n, 'weight') for n in set(nx.all_neighbors(G, v))
    )}

idx = 0
while idx * 9000 < len(G):
    futures = [get_constraint.remote(v) for v in list(G.nodes())[idx * 9000:(idx + 1) * 9000]]
    constraint = ray.get(futures)
    with open(f'constraint_{idx}.pkl', 'wb') as f:
        pickle.dump(constraint, f)
    idx += 1

Firstly I tried

futures = [get_constraint.remote(v) for v in G]
constraint = ray.get(futures)

Then after finishing 10000 tasks the program will get stuck, and I check “ray summary tasks” I found following result
Screenshot 2024-05-26 at 09.55.25

Then I split the total tasks into groups, each group contains 9000 tasks

idx = 0
while idx * 9000 < len(G):
    futures = [get_constraint.remote(v) for v in list(G.nodes())[idx * 9000:(idx + 1) * 9000]]
    constraint = ray.get(futures)
    with open(f'constraint_{idx}.pkl', 'wb') as f:
        pickle.dump(constraint, f)
    idx += 1

After two loops (I can get constraint_0 and constraint_1 properly) the program get stuck with the same result from ray summary tasks

I check gcs_server_log.err and raylet.err, no error recorded.

Also I check ps aux and found there are lots of ray workers IDLE, not sure why not schedule to those idle workers.

Any help here? Thanks in advance

Here is the screenshot showing that many idle workers

Could you please share the local_constraint code? I made a shim but did not catch your bug.

1 Like

I have solved problem.

Seems that I ignored the Warning given by ray summary tasks, which said that only 10000 tasks will be shown and more will be truncated. It is a little bit annoying setting but my bad ignoring it. I printed out the process and confirmed that Ray is still processing the tasks

The second mistake I made is

sum(
        local_constraint(G, v, n, 'weight') for n in set(nx.all_neighbors(G, v))
    )

as for some v, there are thousands neighbor nodes, which makes computing sum(local_constraint) very slow and thus make the program seems get stuck.

I should also distributed computing sum to accelerate the program