Large Read-Only Items for NLP task

rmoughan · January 31, 2021, 7:55pm

Hi! I’m new to Ray and have been using it for some basic NLP tasks. I had a question about the general setup of memory in the system. Ideally, I’d like to have something like:

There’s a text array and some other read-only items that are fairly large but only need to be read by each worker. Then each worker does some computation and local writes that I can combine as an output later on. It’s fairly similar to map-reduce, and I’ve done the GitHub example and seen the pattern document. My confusion comes with whether or not text_array and the other read-only items are being copied over to the workers. From the readouts in !ray memory it seems like they’re not, but when I set up the futures it gives me the warning that the function has a large size when pickled, implying that they are. It would be great if they didn’t have to be copied over since they’re read-only and should be the same across all workers, so I was hoping someone could shed some light on this.

Alex · February 1, 2021, 7:31pm

Hmmm the large pickle warnings are referring to issues when serializing the remote function. They tend to come accidentally including some large object in a function’s scope.

For example,

big_obj = np.zeros(1000**3)

@ray.remote
def foo():
    return np.sum(big_obj)

has to capture and serialize the entire object with the function, whereas we prefer for the object to not be part of the function definition.

Maybe it could help if you could share some sample code?

rmoughan · February 2, 2021, 1:59am

Hi Alex! I think you captured the heart of my issue pretty well. I created a simple example of the idea of what my code is doing below. The actual code is doing something similar to latent semantic analysis but is probably a lot harder to read. So it’s taking in a large text array and doing computation on it in parallel. You can imagine arr is the text array that ideally I don’t want to be copied over to each local worker since it’s read-only.

I’m also confused why the warning says the pickled function has size ~320MB (so almost entirely because of the array arr) while !ray memory says the much more modest ~80KB. I don’t think I’m understanding the difference and the implications of the two.

Alex · February 2, 2021, 2:14am

Yep, so the issue is that function definitions are stored in GCS (Redis) which is only suitable for small function definitions. In general, you should try to put the large object in the object store, then pass that object as an argument. In your case, that might look like:

arr = np.zeros(100000, 4000)

@ray.remote
def foo(arr):
    x = np.zeros(shape=(1000,))
    x[0] = arr[0][0]


arr_ref = ray.put(arr)
foo.remote(arr_ref)

You can find some more details here: Ray Design Patterns - Google Docs

rmoughan · February 2, 2021, 3:28am

Ah I had tried something similar but was lacking some conceptual understanding. I see the issue now, thanks Alex! And to confirm my understanding, ray.put is putting the object in a shared memory object store and then all the workers can reference it using the returned reference id?

Alex · February 2, 2021, 3:40am

Yup! Sounds like you understand now.

Topic		Replies	Views
Best practices around handling giant datasets with ray data (large amount of read tasks) Ray Libraries (Data, Train, Tune, Serve)	5	103	October 15, 2024
Leaking worker memory Ray Core	9	432	February 19, 2021
Question about multiprocessing large array using ray.remote Ray Core	4	939	May 10, 2022
Using ray.put for LARGE numpy arrays Ray Core	12	1325	July 27, 2023
Efficient use of large dictionary access needed by each worker Ray Core	6	866	March 30, 2022

Large Read-Only Items for NLP task

Related topics