How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
- Low: It annoys or frustrates me for a moment.
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
- High: It blocks me to complete my task.
I have over 400G of big training data.
I am managing with sqlite3 fotmat.
Below is the code I test.
OOM occurred during the preprocess and it is killed.
It seems that OOM occurs by preprocessing the entire data and putting it in the object store before learning.
Is there a way to preprocess every training iter and not upload that data in memory?
ray_dataset = ray.data.read_sql(
"SELECT * FROM documents LIMIT 10", create_connection
)
def _preprocess(batch):
batch["total_text"] = batch[["title", "text"]].apply(
lambda x: "{} {}".format(x[0] or "", x[1] or ""), axis=1
)
return batch.drop("title", axis=1).drop("text", axis=1)
def _tokenize(batch):
tokenized_output = tokenizer(
batch["total_text"].values.tolist(),
add_special_tokens=True,
max_length=cfg.data.max_length,
truncation=True,
return_length=True,
return_overflowing_tokens=True,
)
tokenized_output["labels"] = tokenized_output["input_ids"].clone()
return tokenized_output
preprocessing = BatchMapper(_preprocess, batch_format="pandas")
tokenizing = BatchMapper(_tokenize, batch_format="pandas")
preprocessor = Chain(preprocessing, tokenizing)