Is `randomize_block_order` useful if you're doing a single training run?

I’ve been reading through the source code of ray train / ray data, so I can get a clearer understanding of what is happening.

One thing I noticed is that–by default–the trainers set the randomize_block_order parameter to True.

From looking at what this does; my assumption is that this is useful if you are running multiple separate training runs on the same dataset.

But, presumably, if you are running a single training run (with, for example, 10 GPUs) – then this parameter will have no effect, since each block should roughly only be fetched once? (Since the dataset will be split amongst each worker).

Is my intuition correct?

Hey @Vedant_Roy,

It’s true that each block will be fetched once. However, there will still be an effect in that it should inject some amount of randomness into your training job:

# without randomize_block_order
dataset: [b1, b2, b3, b4]
worker1: [b1, b2]
worker2: [b3, b4]

# with randomize_block_order
dataset: [b3, b1, b4, b2]
worker1: [b3, b1]
worker2: [b4, b2]

You can think of it as a light-weight, block-level shuffle.

1 Like