How to use Ray Tune for Stratified K-Fold Cross-validation on my binary image classification model?

Hi, this is my first time trying to use Ray Tune to tune my hyperparameters for my binary image classification model. I’ve completed training on a stratified 5-fold cross validation scheme, meaning that I have a total of five models for each fold. For each fold, I train for about 10 epochs, and based on the validation metric (F1 score), the best model for the fold is selected and that’s how I end up with 5 models.

Now, how do I perform hyperparameter tuning with the same stratified 5-fold cross validation scheme? I’ve looked everywhere for a pytorch example, but there isn’t any regarding k-fold cross validation. I’d appreciate any help, thank you so much!

Hi thanks for the question.
How are you building the datasets for each fold today?
are they static datasets or created dynamically at the start of each run?

Are the non-hp-tuning runs done using Ray Tune?

Just trying to understand the problem a bit more. Thanks.

Thanks for the reply. The dataset is a stratified 5-fold CV training, but they are fixed by setting the random seeds. Therefore, they can be viewed as a static dataset even though they’re “random”. I’m not really sure what you mean about “non-hp-tuning runs”, I haven’t used ray tune in any way yet. Any help would be appreciated, thanks!

Cool. This is exactly what I was curious about.
Which framework did you use for your 5-fold CV run then? Is it SciKitLearn?
I think before we start talking about using Tune to HP tune your job, we need to replicate your single run job using Ray Library somehow.
It could be through Ray AIR Trainers,
https://docs.ray.io/en/latest/train/api/api.html
Or it could be a single trainable function that AIR can launch.
Once we get the single run work on Ray, we can easily adapt and make it a mulit-run tuning job.

Yes, I use the code below, which is sklearn’s StratifiedKFold. By testing a single run job, you’re meaning I should first train the same exact training code I have but just with Ray AIR Trainers? I can go ahead and do that, are there any good examples out there I can follow? My code is based on pytorch. Thanks for your quick replies and help!

new_df_train = train_df.copy(deep=True)
strat_kfold = StratifiedKFold(shuffle = True, random_state = 42) #use default n_split = 5, random_state for reproducibility

#split on white and non-white and add a new column fold to it:
for each_fold, (idx1,idx2) in enumerate (strat_kfold.split(X = new_df_train, y = new_df_train[‘label’])):
new_df_train.loc[idx2,‘fold’] = int(each_fold) #create new fold column with the fold number (up to 5)

@gjoliver Hi, I’d appreciate it if you could get to this whenever you can :slight_smile: thank you so much!

Hi,

here are some hints.

You should be able to write your sklearn job using Ray’s SKLearnTrainer described here:
https://docs.ray.io/en/latest/train/api/doc/ray.train.sklearn.SklearnTrainer.html

to create a stratified train / test split using Ray Dataset, you can take a look at the following example:

def split_stratify(
  ds: Dataset,
  stratify: str,
  test_size: float = .2
) -> Tuple[Dataset, Dataset]:
  '''
      Splits a dataset into training and testing subsets according to a 
      specified test size and optional stratification column.

      Args:
          ds: The dataset to split.
          stratify_by: The name of a column in the dataset to use for stratification. 
          test_size: The proportion of the dataset to use for testing, between 0 and 1. Defaults to 0.2.
      Returns:
          Tuple[Dataset, Dataset]: A tuple of two datasets: the training subset and the testing subset.
  '''
  def split(df: pd.DataFrame) -> pd.DataFrame:
    "Data grouped per strata."
    # Mark everything 'train' first.
    df['__c'] = "train"
    # Mark the test split as 'test'.
    df.loc[df.tail(int(len(df) * test_size)).index,'__c'] = "test"
    return df

  splitted = ds.groupby(stratify).map_groups(split)

  def filter(split: str):
     return splitted.filter(
        lambda x: x['__c'] == split
    ).drop_columns(['__c'])

  return filter("train"), filter("test")

train, test = split_stratify(ds, 'F1')

print(train.to_pandas().groupby('F1').count())

Let’s start from here and see if you can get a single stratified training job up using Ray Train.
Once we do that, it would be relatively straight forward to introduce hp searching on top of this.

@gjoliver Please excuse me if I’m wrong or misunderstanding, but the ray train’s sklearntrainer means that it runs the fit method of sklearn estimators and it seems like you can do the stratification in the “cv” parameter. However, I’m not sure what you mean when I’m not using any of the sklearn’s estimators to train my DL model.

For the stratification, if I already have the dataset stratified in to five different folds for cross validation (I have this saved) during training (I have stratified folds from 0 to 4, and use fold=fold for validation and fold!= fold for training, so a natural 80-20 split), can I ideally just write run the ray tuner five different times for each model training?

@gjoliver Hi I’m so sorry if I’m bothering you, but I would appreciate some guidance, thank you so much!