ValueError: buffer source array is read-only with ds.map_batches and pandas as the batch format

harshit206 · November 23, 2022, 10:19pm

Hi
I am facing problems processing the text data using ds.map_batches with pandas as the batch format. Getting ValueError: buffer source array is read-only . I have described my code below.
I am using ray dataset api to read parquet files stored in S3 using:

ds = ray.data.read_parquet(“S3//PATH”)

The schema looks like this:

schema={‘col A’: string, ‘col B’: string, ‘col C’: list<element: string>}

Load spacy model:

nlp = spacy.load(“en_core_web_lg”)

I am doing basic stuff like lowercasing the text and converting the text to spacy doc. My transformation function:

def transform_batch(batch: pd.DataFrame) -> pd.DataFrame:
        batch = batch.copy(deep=True)
        batch['lower_text'] = batch['text'].map(str.lower)
        batch['spacy_docs'] = batch['lower_text'].map(nlp)
        return batch

Finally, I do:

transformed_ds = ds.map_batches(transform_batch, batch_format=‘pandas’)

The transform_batch function above works fine as a standalone pandas function but using it with ray throws the error

ValueError: buffer source array is read-only

I understand ray uses plasma store to store objects that are immutable which doesn’t allow mutating the object in place. Ray doc and ray team member from the slack community suggested creating a copy of the object as shown in the transform_batch function. However, am facing the same error. Can someone suggest a workaround for this?

jianxiao · November 30, 2022, 6:44pm

Hi @harshit206, welcome to the Ray community!

I tried to make a minimal repro based on what you are doing:

import ray
import pandas as pd

ds = ray.data.from_items([
    {
        "A": "hello",
        "B": "world",
    }
])
ds.show()

def transform_batch(batch: pd.DataFrame) -> pd.DataFrame:
    batch["C"] = "welcome"
    return batch
ds2 = ds.map_batches(transform_batch)
ds2.show()

As you can see, the transform_batch is mutating the batch. And this runs without issue:

2022-11-30 18:40:27,008	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
{'A': 'hello', 'B': 'world'}
Map_Batches: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.43it/s]
{'A': 'hello', 'B': 'world', 'C': 'welcome'}

Do you mind creating a repro script? And also which Ray version are you using?

harshit206 · November 30, 2022, 9:50pm

Hi @jianxiao
Thanks for your response
I found that problem has something to do with the spacy model (nlp object) as I executed the transform batch function without using the spacy model to see if i still get the same error but it ran successfully as you did. I wonder if spacy tries to mutate(convert to spacy doc) the 'text' in place and hence the ValueError: buffer source array is read-only. This is how I solved my problem by using Ray Actors:

@ray.remote
class Textprocessor:
    def __init__(self):

       #setup instructions for spacy model
        import pytextrank
        self.nlp = spacy.load("en_core_web_lg")
        self.nlp.max_length = 1080000  
        self.nlp.add_pipe("textrank")

  def process(self, args):
         ## some processing ###
        return

actors = []
for actor in range(int(ray.cluster_resources()['CPU'])):
    actors.append(Textprocessor.remote())
pool = ActorPool(actors)

for output in pool.map_unordered(lambda a, v: a.process.remote(v), args):
 ## processing

I am using Ray version 2.0.0

jianxiao · November 30, 2022, 11:32pm

Note that you can use actor in .map_batches(UDF, compute=ActorPoolStrategy(min, max), ....) if actor is what you need. And this is a recommended way because the Dataset actorpool can autoscale dynamically between min/max.

Topic		Replies	Views
ValueError: buffer source array is read-only Ray Core	4	1057	October 13, 2023
None value in ds Ray Libraries (Data, Train, Tune, Serve)	3	9	August 6, 2024
Ray.data read_parquet ‘tensor_column_schema’ argument issue	1	398	February 11, 2023
When to use Ray actors vs Ray Dataset for Text Processing Ray Libraries (Data, Train, Tune, Serve)	3	666	November 30, 2022
Ray Dataset Cannot Read Parquet File Ray Data	1	621	August 1, 2022

ValueError: buffer source array is read-only with ds.map_batches and pandas as the batch format

Related topics