Reading Azure blob into spark dataframe using ray remote

Hi,

I am developing a Dash python application that is running on Databricks spark cluster. I am trying to read a parquet file (around 200 million) from Azure blob into spark dataframe and
perform a distinct operation on selected columns and convert the output to dictionary. I am trying to run it using ray remote referring to the link (RayDemo - Databricks) to accelerate the performance, but I am getting the below error:
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Code:

import ray
@ray.remote
def test():
    df = spark.read.parquet(<Azure blob parquet file path>).select(cols)
    testdf = df.select(<colum name>).distinct()
    return testdf
	
	
res = test.remote()
result = ray.get(res)
print(result)

I am not sure if I am doing it right. Please help me resolve the error.

1 Like

Hi @Sumithra_S
Thanks for posting this question!

I want to understand a bit more about Spark operation you are trying to do (you described it as “perform a distinct operation on selected columns and convert the output to dictionary”). I wonder if it makes sense to use Ray Dataset for “last mile” data processing. What do you do with the output dictionary? Is it then used as input of some other tasks?

As for the code snippet that you shared, the problem is you probably cannot call spark.xxx from within anything wrapped in ray.remote decorator, as SparkContext can only be used on the driver, not on worker.

Hi @xwjiang2010

Thank you for the response. I am trying to run a python app that reads data from Azure blob into dataframe and perform various operations on this dataframe that runs on Azure Databricks spark cluster. The python app is integrated with React as UI. Inorder to pass data to react, I am converting the output of every query run on Spark cluster into json. Converting the dataframe to dictionary after performing operations on the dataframe is taking time. So I was wondering if there is a better way of executing the logic on spark UI from the python App.
In my scenario, would I be able to run all the subsequent operations performed on spark dataframe using ray.remote to make the execution of logic on spark dataframe faster?

Thanks
Sumithra Subramanian

@Sumithra_S Yeah I think just decorating with @ray.remote won’t work - as Spark context needs to be run on the driver.

Having a React frontend for a Spark application sounds like a pretty common pattern. Have you checked out how other people architect it?

Hi @xwjiang2010 ,

Got it. And yeah, I have been trying to get help from other forums as well, but haven’t found any solution yet. So I was thinking if I could do the same operation using a ray dataframe if spark doesn’t work.
Could you please help me with some documentation in Ray to understand how to read files in Azure blob storage into ray dataframe and perform same operations on ray df.

Thanks in advance

Yeah sure. I think that is a direction worth looking into.

There are some materials. PTAL:

Thank you @xwjiang2010 :slight_smile: I will try this :slight_smile: