[Core] Ray Cluster with Shared Global Pandas Dataframe

Is it possible to create a Ray configuration where 100 Actors can read+write to a global shared pandas data frame ? What is the fastest most real-time way of accomplishing this? Trying to create a system for real-time analytics / OLAP style like Druid.

Previous responses suggested:
Antipattern: Accessing Global Variable

this works well, but forces the entire DataFrame to be copied to each Actor which is not as fast. Is there a real-time OLAP style way of sharing data between Ray Actors ?

One way is to implement a separate actor that will hold the dataframe for you?

It also depends on what types of operations you want to support on the dataframe.

Thanks @rliaw this is the solution I’m currently using ( from Ray Design Patterns doc ) but I wanted to know if there is a high performance faster method.

You could:

  1. shard the dataframe (and have multiple actors hold different shards)
  2. Use max_concurrency>1 for your actor, allowing multiple processes to interact with it. Note that you’ll still be GIL bottlenecked.