Ray.get doesn't return results even all the tasks are finished

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.
@ray.remote(max_retries=-1)
def build_inverted_list(file_path):
    vocab = set()
    inverted_list = defaultdict(set)
    item_attr = pd.read_parquet(file_path)
    # limit = 0
    for _, row in item_attr.iterrows():
        # if limit > 3000:
            # break
        # limit += 1
        item_id = row[convert_dict["item_id"]]
        for token in standard_tokenizer(row[convert_dict["title"]]):
            if token:
                inverted_list[token].add(item_id.encode())
                vocab.add(token)
    assert len(vocab) == len(inverted_list)
    return (inverted_list, vocab)

# I only copy the related code here for concise
files = sorted(Path(db_path).glob("*"))
lists_vocabs = [build_inverted_list.remote(file_path) for file_path in files]
print("collect inverted lists ...")
lists_vocabs = ray.get(lists_vocabs)
print("collect inverted lists finished ...")

I have 181 files under my db_path folder. When I uncomment the code about limit above to only read 3000 rows of each file, the code works as excepted, so the code should have no bugs. But when I read all the rows (about 800K per file), the code stucks at the lists_vocabs = ray.get(lists_vocabs) for 20h even the dashboard shows that all the tasks are done within about 2.5 h.

I run the code on GCP n1-standard-96. Env: python 3.8, and latest ray 2.4.0. Anyone knows how to solve or debug this issue? Thanks.

@zhiyuanpeng Does the dashboard show any OOM errors, or do the logs show any OOM errors?

cc: @rickyyx @sangcho

Hi Jules, thanks for your reply. I am rerunning the code and will let you know later.

Hi Jules @Jules_Damji, I copy the error message from dashboard.err:

Traceback (most recent call last):

2 File “/home/xx/install/miniconda3/envs/py38/lib/python3.8/site-packages/ray/_private/gcs_utils.py”, line 124, in check_health

3 resp = stub.CheckAlive(req, timeout=timeout)

4 File “/home/xx/install/miniconda3/envs/py38/lib/python3.8/site-packages/grpc/_channel.py”, line 946, in call

5 return _end_unary_response_blocking(state, call, False, None)

6 File “/home/xx/install/miniconda3/envs/py38/lib/python3.8/site-packages/grpc/_channel.py”, line 849, in _end_unary_response_blocking

7 raise _InactiveRpcError(state)

8grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:

9 status = StatusCode.DEADLINE_EXCEEDED

10 details = “Deadline Exceeded”

11 debug_error_string = “UNKNOWN:Deadline Exceeded {created_time:“2023-05-01T18:07:09.933881436+00:00”, grpc_status:4}”

That seems unrelated actually. What’s the version of Ray are you using? Also how did you verify all the tasks are done?

@sangcho I check the dashboard and find all the tasks are done. I run the code on GCP n1-standard-96. Env: python 3.8, and latest ray 2.4.0.

Hmm that’s odd. Is there any useful information from the event view from the dashboard? Ray Dashboard — Ray 3.0.0.dev0

I’d also love to pair debug if that works for you