Question about serialization in Modin

dimberman · February 16, 2021, 7:17pm

Hi everyone! I have a question about serialization in Modin.

If I want to pass a modin PD dataframe between contexts, is there a way to do that? Ideally I’d want to avoid writing the whole DF back to s3/GCS and instead just pass metadata to recreate the connection.
I’d also love to know if ray plans on continuing using the plasma backend or if there is a plan to migrate as it seems some in the pyarrow community are discussing deprecation.

Thank you!

devin-petersohn · February 16, 2021, 7:30pm

Hi @dimberman, thanks for the question!

This issue comment lays out how to pickle Modin objects: Modin objects are not able to be pickled · Issue #672 · modin-project/modin · GitHub

It is a somewhat manual process at the moment, but we are working on it.

Additionally, are you sure that you need a handle to the whole dataframe on each partition? Modin makes use of the whole cluster, so you can run into some memory issues if you are doing identical operations from multiple nodes within the cluster. If you only need a handle to the individual partitions, we are also building an API for that. The right choice mostly depends on what you want to do with the Modin dataframe.

dimberman · February 16, 2021, 7:49pm

Hi @devin-petersohn Thank you for the response!

So here’s my canonical example of what I’d want to do:

Task A pulls a dataframe from s3/GCS/etc. and pre-processes it.

Task B1-B5 pulls that pre-processed dataframe and performs some embarrasingly parallel operation

Task C reduces all of those results (so basically a basic mapreduce).

Looking at your example, am I to assume that the _query_compiler is kind of a “roadmap” to a dataframe similar to an RDD? With Modin does the data actually exist on the machine with the client or is it all taking place in the back-end ray cluster?

By what you were saying about Modin handling the whole cluster, am I to assume that Modin kind of handles a single dataframe instance by separating it into chunks across the cluster? So if I were to say… make 5 copies of that dataframe it would put a pretty heavy load on the cluster?

Thank you again!

devin-petersohn · February 16, 2021, 8:27pm

Yes, it is an internal object that serves as a unified internal API layer. Modin will distribute the data across the ray cluster.

Yes, partitions are split across the cluster

Yes, this would potentially overload the cluster and significantly oversubscribe your resources.

You should be able to do your data-parallel map reduce style operations with Modin’s internal APIs, to avoid issues with multiple copies and access. I am not sure we have native GCS support, but that shouldn’t be a big undertaking if that is the case.

A few questions about this use case:

What is the overall scale of data you’re working with?
What is the granularity of parallelism on the embarrassingly parallel operations (cell-wise, row-wise, column-wise)?
What is the original file format of the data?

rliaw · February 16, 2021, 10:05pm

Hey @dimberman ! By the way, Plasma will be continued as use for the Ray backend and there is no plan to deprecate it.

sangcho · February 16, 2021, 10:14pm

@dimberman Plasma store was contributed from Ray originally, and I think they talk about deprecation because we backported to use in order to improve the performance. We are definitely keep using it for our distributed object store.

dimberman · February 17, 2021, 12:25am

@devin-petersohn Ohhh okay so basically if I were in a case where I wanted to create 10 copies of a dataframe across different contexts, I’d probably be better off serializing that dataframe to s3 once and then pulling it in those parallel functions (though also to your point, if I’m just doing a MapReduce and don’t need to perform DIFFERENT maps across the same data then I can do it all as a single step and allow Modin to handle the parallelization).

I would say at the moment I’m more just kind of trying to figure out what the limits are for doing parallel execution, as the scale can be anywhere from a few bytes to gigabytes. It’s hard to prescribe exact usages as there are a pretty wide range of potential use-cases.

I think even knowing that I can save and transfer the query compiler until something more explicit is exposed buys me a lot. I can test it with the caveat that this is not necessarily meant to be spread across many tasks.

Thank you @rliaw and sangcho (can’t @ you because I’m a new user)! I’m glad to hear that . Plasma seems like a really cool project so I was pretty surprised to see that in the pyarrow threads. That makes total sense.

Topic		Replies	Views
Optimal cluster settings for Modin dataset creation Ray Data	1	534	January 3, 2023
Using modin with ray trying to save to and load from parquet without success. losing my mind	2	69	August 6, 2024
Ray client: Can't pickle custom serialized object Ray Client	2	2184	February 3, 2022
Why is my torch AOTInductor model inference class not serializable?	1	177	April 30, 2024
Can't pickle pyarrow.dataset.Expression Ray Serve	9	1765	June 7, 2021

Question about serialization in Modin

Related topics