hello:
I am using ray for data index statistics, but I found that ray cannot modify the name of the index column like pandas. How should I deal with this situation?
Moving to Ray AIR category.
Hi @839576266 - could you help provide an example code for Pandas? What you want to achieve in Ray Data?
Thank you for your reply.
I want to use ray to realize the function of data indicator statistics, but when the data is saved, I find that ray Dataset cannot output a column with indicator names (such as sum and mean) like pandas
import pandas as pd
dict = {'x_0': [1, 2, 3], 'x_1': [4, 5, 6]}
df = pd.DataFrame(dict)
print(df.describe())
Hi @839576266 , currently when creating a Ray Dataset from an existing Pandas DataFrame (ray.data.from_pandas), the resulting Dataset does not carry over the index column.
For the code you are writing, is it a requirement that the indicator names from the Pandas DataFrame must stay in the index? Or is it possible to move it out into a column with pd.DataFrame.reset_index()? If this is OK, then you could accomplish your desired result with something like:
>>> import pandas as pd
>>> dct = {'x_0': [1, 2, 3], 'x_1': [4, 5, 6]}
>>> df = pd.DataFrame(dct)
>>> df_summary = df.describe().reset_index()
>>> df_summary
index x_0 x_1
0 count 3.0 3.0
1 mean 2.0 5.0
2 std 1.0 1.0
3 min 1.0 4.0
4 25% 1.5 4.5
5 50% 2.0 5.0
6 75% 2.5 5.5
7 max 3.0 6.0
>>> import ray
>>> ds_summary = ray.data.from_pandas(df_summary)
>>> ds_summary.schema()
PandasBlockSchema(names=['index', 'x_0', 'x_1'], types=[dtype('O'), dtype('float64'), dtype('float64')])
>>> ds_summary.take_all()
[{'index': 'count', 'x_0': 3.0, 'x_1': 3.0}, {'index': 'mean', 'x_0': 2.0, 'x_1': 5.0}, {'index': 'std', 'x_0': 1.0, 'x_1': 1.0}, {'index': 'min', 'x_0': 1.0, 'x_1': 4.0}, {'index': '25%', 'x_0': 1.5, 'x_1': 4.5}, {'index': '50%', 'x_0': 2.0, 'x_1': 5.0}, {'index': '75%', 'x_0': 2.5, 'x_1': 5.5}, {'index': 'max', 'x_0': 3.0, 'x_1': 6.0}]
Thanks for your reply,I try to solve my question with your method.