Large dataset ray dataset OOM

Ju_Yunsang · June 19, 2023, 9:51am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.

I have over 400G of big training data.
I am managing with sqlite3 fotmat.
Below is the code I test.
OOM occurred during the preprocess and it is killed.
It seems that OOM occurs by preprocessing the entire data and putting it in the object store before learning.
Is there a way to preprocess every training iter and not upload that data in memory?

    ray_dataset = ray.data.read_sql(
            "SELECT * FROM documents LIMIT 10", create_connection
        )

    def _preprocess(batch):
        batch["total_text"] = batch[["title", "text"]].apply(
            lambda x: "{} {}".format(x[0] or "", x[1] or ""), axis=1
        )
        return batch.drop("title", axis=1).drop("text", axis=1)

    def _tokenize(batch):

        tokenized_output = tokenizer(
            batch["total_text"].values.tolist(),
            add_special_tokens=True,
            max_length=cfg.data.max_length,
            truncation=True,
            return_length=True,
            return_overflowing_tokens=True,
        )
        tokenized_output["labels"] = tokenized_output["input_ids"].clone()

        return tokenized_output


    preprocessing = BatchMapper(_preprocess, batch_format="pandas")
    tokenizing = BatchMapper(_tokenize, batch_format="pandas")
    preprocessor = Chain(preprocessing, tokenizing)

bveeramani · June 21, 2023, 7:55pm

Are you using an AIR Trainer? If so, could you try enabling Streaming Ingest?

Ju_Yunsang · July 3, 2023, 1:52am

@bveeramani
Thank you for reply
i solved problem using Configuring Training Datasets — Ray 2.5.1

Topic		Replies	Views
Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)	20	1697	July 28, 2022
Benchmarks for Ray Data? Ray Data	13	1077	October 5, 2023
Ray Data streaming not streaming smoothly Ray Data	8	791	May 30, 2023
[Data][ray2.2.0] Out of Memory when using ray.data.from_torch Ray Data	0	510	February 8, 2023
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1293	February 14, 2023

Large dataset ray dataset OOM

Related topics