Large dataset ray dataset OOM

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

I have over 400G of big training data.
I am managing with sqlite3 fotmat.
Below is the code I test.
OOM occurred during the preprocess and it is killed.
It seems that OOM occurs by preprocessing the entire data and putting it in the object store before learning.
Is there a way to preprocess every training iter and not upload that data in memory?

    ray_dataset = ray.data.read_sql(
            "SELECT * FROM documents LIMIT 10", create_connection
        )

    def _preprocess(batch):
        batch["total_text"] = batch[["title", "text"]].apply(
            lambda x: "{} {}".format(x[0] or "", x[1] or ""), axis=1
        )
        return batch.drop("title", axis=1).drop("text", axis=1)

    def _tokenize(batch):

        tokenized_output = tokenizer(
            batch["total_text"].values.tolist(),
            add_special_tokens=True,
            max_length=cfg.data.max_length,
            truncation=True,
            return_length=True,
            return_overflowing_tokens=True,
        )
        tokenized_output["labels"] = tokenized_output["input_ids"].clone()

        return tokenized_output


    preprocessing = BatchMapper(_preprocess, batch_format="pandas")
    tokenizing = BatchMapper(_tokenize, batch_format="pandas")
    preprocessor = Chain(preprocessing, tokenizing)

Are you using an AIR Trainer? If so, could you try enabling Streaming Ingest?

@bveeramani
Thank you for reply
i solved problem using Configuring Training Datasets — Ray 2.5.1

1 Like