How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
@ray.remote(max_retries=-1)
def build_inverted_list(file_path):
vocab = set()
inverted_list = defaultdict(set)
item_attr = pd.read_parquet(file_path)
# limit = 0
for _, row in item_attr.iterrows():
# if limit > 3000:
# break
# limit += 1
item_id = row[convert_dict["item_id"]]
for token in standard_tokenizer(row[convert_dict["title"]]):
if token:
inverted_list[token].add(item_id.encode())
vocab.add(token)
assert len(vocab) == len(inverted_list)
return (inverted_list, vocab)
# I only copy the related code here for concise
files = sorted(Path(db_path).glob("*"))
lists_vocabs = [build_inverted_list.remote(file_path) for file_path in files]
print("collect inverted lists ...")
lists_vocabs = ray.get(lists_vocabs)
print("collect inverted lists finished ...")
I have 181 files under my db_path
folder. When I uncomment the code about limit
above to only read 3000 rows of each file, the code works as excepted, so the code should have no bugs. But when I read all the rows (about 800K per file), the code stucks at the lists_vocabs = ray.get(lists_vocabs)
for 20h even the dashboard shows that all the tasks are done within about 2.5 h.
I run the code on GCP n1-standard-96. Env: python 3.8, and latest ray 2.4.0. Anyone knows how to solve or debug this issue? Thanks.