So I have this use case, where I want to get Data querying Hive and put it into a Ray Dataset.
Since Ray does not natively interface with Hive (Deltacat when? ) i try to use spark on ray to query hive, and convert that to a ray dataset.
Everything works fine! (nice). Now, the issue is, maintaining spark on ray requires having spark executors active, which block CPU and memory even if they are not doing anything. I do not have any other use for Spark other than querying the data.
The issue is, if I shutdown spark, i lose my ray dataset too, as the dataset “owner” died:
OwnerDiedError: Failed to retrieve object 00bb3e4d5288eede0cb3f20f7e6a2ac76fd12e830200000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.
Is there any way to move the dataset “fully” in Ray and remove the spark ownership? Otherwise i would not be able to do much with my ray cluster as all the resources are taken up by spark.
I tried to run fully_executed()
hoping it would move data away from spark into Ray - but no luck unfortunately.