RayDMatrix reordering dataframe columns

Arindam_Jati · October 29, 2021, 4:54am

RayDMatrix reorders the columns of a dataframe particularly when label argument is provided. Here is a minimal example:

(Pdb)
df = pd.DataFrame(np.random.randn(5, 4), columns=['1','2','3','10'])

(Pdb) df
          1         2         3        10
0 -0.591416 -0.052763 -1.406966 -1.590726
1 -0.056861 -1.206818 -0.337770 -2.061194
2  0.716264 -0.934042  1.241450 -0.099843
3 -0.346618  0.603396 -1.095848 -0.758888
4 -1.293752 -0.684838  0.206635 -0.549543

(Pdb) dset = RayDMatrix(df, label='3', num_actors=1)

(Pdb) dset.get_data(0)
{'data':           1        10         2
0 -0.591416 -1.590726 -0.052763
1 -0.056861 -2.061194 -1.206818
2  0.716264 -0.099843 -0.934042
3 -0.346618 -0.758888  0.603396
4 -1.293752 -0.549543 -0.684838, 
'label': 0   -1.406966
1   -0.337770
2    1.241450
3   -1.095848
4    0.206635
Name: 3, dtype: float64, 'weight': None, 'base_margin': None, 'label_lower_bound': None, 'label_upper_bound': None

The columns of data are re-ordered. It should be in the original order [1, 2, 10] .
This is possibly happening because of x = x[x.columns.difference(exclude_cols)] here.

Can someone please help me understand if I am doing something wrong here?

amogkam · November 5, 2021, 6:16am

Hey @Arindam_Jati thanks for pointing this out!

This should be fixed on master once Keep ordering of columns on conversion to RayDMatrix by amogkam · Pull Request #170 · ray-project/xgboost_ray · GitHub is merged!

Topic		Replies	Views
Does ray dataset support a display method similar to dataframe Ray Data	5	570	January 16, 2023
[Dataset] function add_column inserts repeats of sub-column instead of whole column Ray Data	2	430	November 30, 2022
Process/Materialize Data In Input Order Ray Data	1	264	March 29, 2024
Converting dask dataframe/array to ray dataset Ray Data	3	808	April 18, 2022
How to deal with labeled image datasets? Ray Data	11	682	May 31, 2023

RayDMatrix reordering dataframe columns

Related topics