High GPU Memory (DeepSpeed+HuggingFace+Ray)

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.50.0
  • Python version: 3.11.10
  • transformers: 4.57.1
  • trl: 0.25.1

3. What happened vs. what you expected:

  • Expected: I would like to know if this situation is normal.
  • Actual: When training with code similar to the official example (deepspeed + huggingface + ray), I noticed that GPU memory usage is higher than normal. With parameters {num_workers=2, resources_per_worker = {"GPU": 1, "CPU": 4}}, the VRAM consumption is 23GB*2. However, for the same code, when I do not use ray, the VRAM consumption is 13GB*2.I initially suspected it was an issue related to data loading, specifically that the data might have been incorrectly duplicated in memory during the import phase. I tried using both the official data example train.get_dataset_shard("train") and the commonly used data loading method load_dataset, but obtained the same results.
  • Meanwhile, the training speed (4.5 minutes per epoch) is also slower than the normal situation (3.5 minutes per epoch). However, considering the added overhead from callbacks and communication costs, I can temporarily accept this issue.
python code

from datasets import load_dataset, Dataset
def load_data(self):
    ······
    data: Dataset = load_dataset("json", data_files = self.data_path, split = "train") # type: ignore
    data = data.map(_format_conversations, remove_columns = data.column_names, batched = True, batch_size = 16)
    ······
    # I also try this
    ray_datasets = {"train": ray.data.from_huggingface(data)}
    return ray_datasets
python code
from trl import SFTTrainer
from ray.train.huggingface.transformers import prepare_trainer
def train_func(self):
    ·······
    dataset = self.load_data()
    trainer = SFTTrainer(train_dataset=dataset,·······)
    trainer = prepare_trainer(trainer)
    trainer.train()
def ray_train(self):
    trainer = TorchTrainer(train_loop_per_worker = self.train_func,                scaling_config = ScalingConfig(
                    num_workers = 2,
                    use_gpu = True,
                    accelerator_type = "A100",
                    resources_per_worker = {"GPU": 1, "CPU": 4},
                    placement_strategy = "STRICT_PACK",
                ),······)
  • I tried using the official data example train.get_dataset_shard("train")
python code
train_ds = raytrain.get_dataset_shard("train")
eval_ds = raytrain.get_dataset_shard("validation")
train_ds_iterable = train_ds.iter_torch_batches(batch_size = self.trainParameter.batch_size,local_shuffle_buffer_size = raytrain.get_context().get_world_size() * self.trainParameter.batch_size,)
eval_ds_iterable = eval_ds.iter_torch_batches(batch_size = 2)
·······
trainer = SFTTrainer(self.model,
train_dataset = train_ds_iterable,
eval_dataset = eval_ds_iterable, ······)
  • but obtained the same results.

It seems likely that higher GPU memory usage when using Ray Train with DeepSpeed and HuggingFace, compared to standalone training, is not unexpected. Ray introduces additional overhead due to distributed coordination, data sharding, and possibly duplicated model or data objects in each worker process. This can result in each worker consuming more VRAM than in a non-Ray setup, even when using recommended data loading patterns like train.get_dataset_shard(“train”) or ray.data.from_huggingface(data). The observed slower training speed is also consistent with the extra communication and callback overhead in distributed Ray jobs, as noted in your description.

There is no clear evidence in the official documentation or examples that this is a bug; rather, it appears to be a known tradeoff of distributed training with Ray, especially when using DeepSpeed and HuggingFace. If memory usage is a concern, you might try reducing batch size, tuning DeepSpeed config, or profiling memory to ensure no unnecessary duplication, but some increase in VRAM usage is expected. For more on Ray’s memory usage and troubleshooting, see Ray memory debugging guide and Ray Data performance tips. Would you like a step-by-step breakdown or suggestions for further memory optimization?

Sources:

Hint: Mention @RunLLM in the post for followups.

my deepspeed config such as :

{

  "train_batch_size": "auto",

  "train_micro_batch_size_per_gpu": "auto",

  "gradient_accumulation_steps": "auto",

  "gradient_clipping": "auto",

  "zero_allow_untested_optimizer": true,

  "fp16": {

    "enabled": "auto",

    "loss_scale": 0,

    "loss_scale_window": 1000,

    "initial_scale_power": 16,

    "hysteresis": 2,

    "min_loss_scale": 1

  },

  "bf16": {

    "enabled": "auto"

  },

  "zero_stage": 2,

  "deepspeed_multinode_launcher": "standard",

  "zero3_init_flag": false,

  "zero_optimization": {

    "stage": 2,

    "allgather_partitions": true,

    "allgather_bucket_size": 5e8,

    "overlap_comm": true,

    "reduce_scatter": true,

    "reduce_bucket_size": 5e8,

    "contiguous_gradients": true,

    "round_robin_gradients": true

  }

}