Hi
I am facing problems processing the text data using ds.map_batches with pandas as the batch format. Getting ValueError: buffer source array is read-only
. I have described my code below.
I am using ray dataset api to read parquet files stored in S3 using:
ds = ray.data.read_parquet(“S3//PATH”)
The schema looks like this:
schema={‘col A’: string, ‘col B’: string, ‘col C’: list<element: string>}
Load spacy model:
nlp = spacy.load(“en_core_web_lg”)
I am doing basic stuff like lowercasing the text and converting the text to spacy doc. My transformation function:
def transform_batch(batch: pd.DataFrame) -> pd.DataFrame:
batch = batch.copy(deep=True)
batch['lower_text'] = batch['text'].map(str.lower)
batch['spacy_docs'] = batch['lower_text'].map(nlp)
return batch
Finally, I do:
transformed_ds = ds.map_batches(transform_batch, batch_format=‘pandas’)
The transform_batch function above works fine as a standalone pandas function but using it with ray throws the error
ValueError: buffer source array is read-only
I understand ray uses plasma store to store objects that are immutable which doesn’t allow mutating the object in place. Ray doc and ray team member from the slack community suggested creating a copy of the object as shown in the transform_batch function. However, am facing the same error. Can someone suggest a workaround for this?
1 Like
Hi @harshit206, welcome to the Ray community!
I tried to make a minimal repro based on what you are doing:
import ray
import pandas as pd
ds = ray.data.from_items([
{
"A": "hello",
"B": "world",
}
])
ds.show()
def transform_batch(batch: pd.DataFrame) -> pd.DataFrame:
batch["C"] = "welcome"
return batch
ds2 = ds.map_batches(transform_batch)
ds2.show()
As you can see, the transform_batch
is mutating the batch. And this runs without issue:
2022-11-30 18:40:27,008 INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
{'A': 'hello', 'B': 'world'}
Map_Batches: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.43it/s]
{'A': 'hello', 'B': 'world', 'C': 'welcome'}
Do you mind creating a repro script? And also which Ray version are you using?
Hi @jianxiao
Thanks for your response
I found that problem has something to do with the spacy model (nlp object) as I executed the transform batch function without using the spacy model to see if i still get the same error but it ran successfully as you did. I wonder if spacy tries to mutate(convert to spacy doc) the 'text'
in place and hence the ValueError: buffer source array is read-only
. This is how I solved my problem by using Ray Actors:
@ray.remote
class Textprocessor:
def __init__(self):
#setup instructions for spacy model
import pytextrank
self.nlp = spacy.load("en_core_web_lg")
self.nlp.max_length = 1080000
self.nlp.add_pipe("textrank")
def process(self, args):
## some processing ###
return
actors = []
for actor in range(int(ray.cluster_resources()['CPU'])):
actors.append(Textprocessor.remote())
pool = ActorPool(actors)
for output in pool.map_unordered(lambda a, v: a.process.remote(v), args):
## processing
I am using Ray version 2.0.0
Note that you can use actor in .map_batches(UDF, compute=ActorPoolStrategy(min, max), ....)
if actor is what you need. And this is a recommended way because the Dataset actorpool can autoscale dynamically between min/max.
1 Like