Hi everyone! I have a question about serialization in Modin.
If I want to pass a modin PD dataframe between contexts, is there a way to do that? Ideally I’d want to avoid writing the whole DF back to s3/GCS and instead just pass metadata to recreate the connection.
I’d also love to know if ray plans on continuing using the plasma backend or if there is a plan to migrate as it seems some in the pyarrow community are discussing deprecation.
It is a somewhat manual process at the moment, but we are working on it.
Additionally, are you sure that you need a handle to the whole dataframe on each partition? Modin makes use of the whole cluster, so you can run into some memory issues if you are doing identical operations from multiple nodes within the cluster. If you only need a handle to the individual partitions, we are also building an API for that. The right choice mostly depends on what you want to do with the Modin dataframe.
So here’s my canonical example of what I’d want to do:
Task A pulls a dataframe from s3/GCS/etc. and pre-processes it.
Task B1-B5 pulls that pre-processed dataframe and performs some embarrasingly parallel operation
Task C reduces all of those results (so basically a basic mapreduce).
Looking at your example, am I to assume that the _query_compiler is kind of a “roadmap” to a dataframe similar to an RDD? With Modin does the data actually exist on the machine with the client or is it all taking place in the back-end ray cluster?
By what you were saying about Modin handling the whole cluster, am I to assume that Modin kind of handles a single dataframe instance by separating it into chunks across the cluster? So if I were to say… make 5 copies of that dataframe it would put a pretty heavy load on the cluster?
Yes, it is an internal object that serves as a unified internal API layer. Modin will distribute the data across the ray cluster.
Yes, partitions are split across the cluster
Yes, this would potentially overload the cluster and significantly oversubscribe your resources.
You should be able to do your data-parallel map reduce style operations with Modin’s internal APIs, to avoid issues with multiple copies and access. I am not sure we have native GCS support, but that shouldn’t be a big undertaking if that is the case.
A few questions about this use case:
What is the overall scale of data you’re working with?
What is the granularity of parallelism on the embarrassingly parallel operations (cell-wise, row-wise, column-wise)?
@dimberman Plasma store was contributed from Ray originally, and I think they talk about deprecation because we backported to use in order to improve the performance. We are definitely keep using it for our distributed object store.
@devin-petersohn Ohhh okay so basically if I were in a case where I wanted to create 10 copies of a dataframe across different contexts, I’d probably be better off serializing that dataframe to s3 once and then pulling it in those parallel functions (though also to your point, if I’m just doing a MapReduce and don’t need to perform DIFFERENT maps across the same data then I can do it all as a single step and allow Modin to handle the parallelization).
I would say at the moment I’m more just kind of trying to figure out what the limits are for doing parallel execution, as the scale can be anywhere from a few bytes to gigabytes. It’s hard to prescribe exact usages as there are a pretty wide range of potential use-cases.
I think even knowing that I can save and transfer the query compiler until something more explicit is exposed buys me a lot. I can test it with the caveat that this is not necessarily meant to be spread across many tasks.
Thank you @rliaw and sangcho (can’t @ you because I’m a new user)! I’m glad to hear that . Plasma seems like a really cool project so I was pretty surprised to see that in the pyarrow threads. That makes total sense.