Reading Azure blob into spark dataframe using ray remote

Sumithra_S · August 15, 2022, 2:36pm

Hi,

I am developing a Dash python application that is running on Databricks spark cluster. I am trying to read a parquet file (around 200 million) from Azure blob into spark dataframe and
perform a distinct operation on selected columns and convert the output to dictionary. I am trying to run it using ray remote referring to the link (RayDemo - Databricks) to accelerate the performance, but I am getting the below error:
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Code:

import ray
@ray.remote
def test():
    df = spark.read.parquet(<Azure blob parquet file path>).select(cols)
    testdf = df.select(<colum name>).distinct()
    return testdf
	
	
res = test.remote()
result = ray.get(res)
print(result)

I am not sure if I am doing it right. Please help me resolve the error.

xwjiang2010 · August 16, 2022, 8:06pm

Hi @Sumithra_S
Thanks for posting this question!

I want to understand a bit more about Spark operation you are trying to do (you described it as “perform a distinct operation on selected columns and convert the output to dictionary”). I wonder if it makes sense to use Ray Dataset for “last mile” data processing. What do you do with the output dictionary? Is it then used as input of some other tasks?

As for the code snippet that you shared, the problem is you probably cannot call spark.xxx from within anything wrapped in ray.remote decorator, as SparkContext can only be used on the driver, not on worker.

Sumithra_S · August 18, 2022, 9:52pm

Hi @xwjiang2010

Thank you for the response. I am trying to run a python app that reads data from Azure blob into dataframe and perform various operations on this dataframe that runs on Azure Databricks spark cluster. The python app is integrated with React as UI. Inorder to pass data to react, I am converting the output of every query run on Spark cluster into json. Converting the dataframe to dictionary after performing operations on the dataframe is taking time. So I was wondering if there is a better way of executing the logic on spark UI from the python App.
In my scenario, would I be able to run all the subsequent operations performed on spark dataframe using ray.remote to make the execution of logic on spark dataframe faster?

Thanks
Sumithra Subramanian

xwjiang2010 · August 20, 2022, 2:12pm

@Sumithra_S Yeah I think just decorating with @ray.remote won’t work - as Spark context needs to be run on the driver.

Having a React frontend for a Spark application sounds like a pretty common pattern. Have you checked out how other people architect it?

Sumithra_S · August 21, 2022, 12:14am

Hi @xwjiang2010 ,

Got it. And yeah, I have been trying to get help from other forums as well, but haven’t found any solution yet. So I was thinking if I could do the same operation using a ray dataframe if spark doesn’t work.
Could you please help me with some documentation in Ray to understand how to read files in Azure blob storage into ray dataframe and perform same operations on ray df.

Thanks in advance

xwjiang2010 · August 22, 2022, 3:54pm

Yeah sure. I think that is a direction worth looking into.

There are some materials. PTAL:

Sumithra_S · August 25, 2022, 6:28pm

Thank you @xwjiang2010 I will try this

Topic		Replies	Views
Ray.data.read_parquet example on azure blob storage not working	0	185	May 10, 2024
Proper workflow to read local parquet file and use it on remote worker? Ray Data	13	1424	May 24, 2023
‘Worker’ object has no attribute ‘core_worker’	1	88	November 13, 2024
Cannot use Ray with a Spark DF inside dict	0	239	May 13, 2023
Cannot read parquet from S3	2	786	October 20, 2022

Reading Azure blob into spark dataframe using ray remote

Related topics