Large Read-Only Items for NLP task

Hi! I’m new to Ray and have been using it for some basic NLP tasks. I had a question about the general setup of memory in the system. Ideally, I’d like to have something like:

There’s a text array and some other read-only items that are fairly large but only need to be read by each worker. Then each worker does some computation and local writes that I can combine as an output later on. It’s fairly similar to map-reduce, and I’ve done the GitHub example and seen the pattern document. My confusion comes with whether or not text_array and the other read-only items are being copied over to the workers. From the readouts in !ray memory it seems like they’re not, but when I set up the futures it gives me the warning that the function has a large size when pickled, implying that they are. It would be great if they didn’t have to be copied over since they’re read-only and should be the same across all workers, so I was hoping someone could shed some light on this.

Hmmm the large pickle warnings are referring to issues when serializing the remote function. They tend to come accidentally including some large object in a function’s scope.

For example,

big_obj = np.zeros(1000**3)

def foo():
    return np.sum(big_obj)

has to capture and serialize the entire object with the function, whereas we prefer for the object to not be part of the function definition.

Maybe it could help if you could share some sample code?

Hi Alex! I think you captured the heart of my issue pretty well. I created a simple example of the idea of what my code is doing below. The actual code is doing something similar to latent semantic analysis but is probably a lot harder to read. So it’s taking in a large text array and doing computation on it in parallel. You can imagine arr is the text array that ideally I don’t want to be copied over to each local worker since it’s read-only.

I’m also confused why the warning says the pickled function has size ~320MB (so almost entirely because of the array arr) while !ray memory says the much more modest ~80KB. I don’t think I’m understanding the difference and the implications of the two.

Yep, so the issue is that function definitions are stored in GCS (Redis) which is only suitable for small function definitions. In general, you should try to put the large object in the object store, then pass that object as an argument. In your case, that might look like:

arr = np.zeros(100000, 4000)

def foo(arr):
    x = np.zeros(shape=(1000,))
    x[0] = arr[0][0]

arr_ref = ray.put(arr)

You can find some more details here: Ray Design Patterns - Google Docs

1 Like

Ah I had tried something similar but was lacking some conceptual understanding. I see the issue now, thanks Alex! And to confirm my understanding, ray.put is putting the object in a shared memory object store and then all the workers can reference it using the returned reference id?

Yup! Sounds like you understand now.

1 Like