Maintain Dataset after terminating spark

So I have this use case, where I want to get Data querying Hive and put it into a Ray Dataset.

Since Ray does not natively interface with Hive (Deltacat when? :smiley: ) i try to use spark on ray to query hive, and convert that to a ray dataset.

Everything works fine! (nice). Now, the issue is, maintaining spark on ray requires having spark executors active, which block CPU and memory even if they are not doing anything. I do not have any other use for Spark other than querying the data.

The issue is, if I shutdown spark, i lose my ray dataset too, as the dataset “owner” died:

OwnerDiedError: Failed to retrieve object 00bb3e4d5288eede0cb3f20f7e6a2ac76fd12e830200000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.

Is there any way to move the dataset “fully” in Ray and remove the spark ownership? Otherwise i would not be able to do much with my ray cluster as all the resources are taken up by spark.

I tried to run fully_executed() hoping it would move data away from spark into Ray - but no luck unfortunately.

Looks like this should be possible - let me take a look and report back here:

Ok, this can be achieved, but not with from_spark as it is not using the _use_owner keyword.

would you be OK if I opened a small PR to add that parameter to from_spark? so that it could give users the flexibility to persist their dataset even after terminating the spark process.

Hey @Andrea_Pisoni, thank you for pushing on this and figuring it out! And that contribution would be awesome, it looks like the addition of that parameter to RayDP fell off of our radar; we should definitely expose an equivalent parameter in ray.data.from_spark().

Please let me know if you want any help with the PR!