RayDMatrix
reorders the columns of a dataframe particularly when label
argument is provided. Here is a minimal example:
(Pdb)
df = pd.DataFrame(np.random.randn(5, 4), columns=['1','2','3','10'])
(Pdb) df
1 2 3 10
0 -0.591416 -0.052763 -1.406966 -1.590726
1 -0.056861 -1.206818 -0.337770 -2.061194
2 0.716264 -0.934042 1.241450 -0.099843
3 -0.346618 0.603396 -1.095848 -0.758888
4 -1.293752 -0.684838 0.206635 -0.549543
(Pdb) dset = RayDMatrix(df, label='3', num_actors=1)
(Pdb) dset.get_data(0)
{'data': 1 10 2
0 -0.591416 -1.590726 -0.052763
1 -0.056861 -2.061194 -1.206818
2 0.716264 -0.099843 -0.934042
3 -0.346618 -0.758888 0.603396
4 -1.293752 -0.549543 -0.684838,
'label': 0 -1.406966
1 -0.337770
2 1.241450
3 -1.095848
4 0.206635
Name: 3, dtype: float64, 'weight': None, 'base_margin': None, 'label_lower_bound': None, 'label_upper_bound': None
The columns of data
are re-ordered. It should be in the original order [1, 2, 10]
.
This is possibly happening because of x = x[x.columns.difference(exclude_cols)]
here.
Can someone please help me understand if I am doing something wrong here?