Maintain Dataset after terminating spark

Andrea_Pisoni · September 1, 2022, 11:47am

So I have this use case, where I want to get Data querying Hive and put it into a Ray Dataset.

Since Ray does not natively interface with Hive (Deltacat when? ) i try to use spark on ray to query hive, and convert that to a ray dataset.

Everything works fine! (nice). Now, the issue is, maintaining spark on ray requires having spark executors active, which block CPU and memory even if they are not doing anything. I do not have any other use for Spark other than querying the data.

The issue is, if I shutdown spark, i lose my ray dataset too, as the dataset “owner” died:

OwnerDiedError: Failed to retrieve object 00bb3e4d5288eede0cb3f20f7e6a2ac76fd12e830200000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.

Is there any way to move the dataset “fully” in Ray and remove the spark ownership? Otherwise i would not be able to do much with my ray cluster as all the resources are taken up by spark.

I tried to run fully_executed() hoping it would move data away from spark into Ray - but no luck unfortunately.

Andrea_Pisoni · September 1, 2022, 11:58am

Looks like this should be possible - let me take a look and report back here:

github.com

oap-project/raydp/blob/master/python/raydp/tests/test_data_owner_transfer.py


import sys
import time

import pytest
import ray
from ray._private.client_mode_hook import client_mode_wrap
from ray.exceptions import RayTaskError, OwnerDiedError
import raydp


def gen_test_data():
  from pyspark.sql.session import SparkSession
  s = SparkSession.getActiveSession()
 
  data = []
  tmp = [("ming", 20, 15552211521),
          ("hong", 19, 13287994007),
          ("dave", 21, 15552211523),
          ("john", 40, 15322211523),

This file has been truncated. show original

Andrea_Pisoni · September 1, 2022, 2:37pm

Ok, this can be achieved, but not with from_spark as it is not using the _use_owner keyword.

would you be OK if I opened a small PR to add that parameter to from_spark? so that it could give users the flexibility to persist their dataset even after terminating the spark process.

Clark_Zinzow · September 1, 2022, 6:42pm

Hey @Andrea_Pisoni, thank you for pushing on this and figuring it out! And that contribution would be awesome, it looks like the addition of that parameter to RayDP fell off of our radar; we should definitely expose an equivalent parameter in ray.data.from_spark().

Please let me know if you want any help with the PR!

Topic		Replies	Views
`OwnerDiedError` if dataset owner actor handle get out of scope Ray Core	1	374	May 11, 2023
Proper workflow for keeping Ray memory clean and separating returned python objects from their Ray references Ray Core	6	3309	May 11, 2022
Is it possible to share objects between different driver processes? Ray Core	1	633	July 22, 2022
ray.exceptions.OwnerDiedError: Failed to retrieve object Ray Clusters	4	1954	July 7, 2022
Multiple _DesignatedBlockOwner processes Kubernetes	9	1020	February 17, 2022

Maintain Dataset after terminating spark

Related topics