ValueError: buffer source array is read-only

Hello folks

Trying to run a remote task on cpu machines but get the following value error:
buffer source array is read-only
The cropped bottom traceback:

output = model.ops.gather_add(vectors, keys)
  File "thinc/backends/numpy_ops.pyx", line 440, in thinc.backends.numpy_ops.NumpyOps.gather_add
  File "stringsource", line 660, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 350, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only

Attaching the code snippet

import ray
import spacy


@ray.remote
def my_task(model,text):
    ''' 
    '''
    print(f"Processing: {text}")
    doc = model(text)
    return doc

if __name__ == '__main__':
   ray.init()
   model = spacy.load('en_core_web_lg')
   model_ref = ray.put(model)
   texts = ['I like you. What are you doing?','I am fine. What about you?']
   ref_ids = [my_task.remote(model_ref,text) for text in texts]

  while len(ref_ids):
      processed, unprocessed = ray.wait(ref_ids)
      ref_ids = unprocessed
     if processed:
          out = ray.get(processed)
          print(out)

However if I run the same model on GPU machines it throws no error. There is a slight change in the code. @ray.remote(num_gpus=1) is introduced and spacy.require_gpu() is called before loading the model.

ray version: 2.2.0
spacy version: 3.4.4

Please help me!
Thanks

@KMayank29 the object put in Ray is immutable, which might be the reason why you running into this issue. Could you try to copy the model in my_task to see if it solves the issue?

Hello @KMayank29,

I’m currently experiencing the same exact error when attempting to carry out the parallel tokenization (using spaCy) of multiple texts.

Did you manage to find a solution?

class Model:
    def __init__(self, model_path):
        self.model_path = model_path
        self.model = spacy.load(self.model_path)

    def __reduce__(self):
        serialized_data = (self.model_path, )
        return self.__class__, serialized_data
        
model = Model(model_path)
model_ref = ray.put(model)

The above code helped me.

After writing here I’ve found that, at least for tokenization, it’s easier to just use a
spaCy pipeline:

nlp = spacy.load("it_core_news_lg", disable=["tagger", "attribute_ruler", "lemmatizer"])
df['tokenized_strings'] = list(nlp.pipe(df['initial_strings'], batch_size=256, n_process=10))

But thank you anyways for the quick reply.