High GPU Memory (DeepSpeed+HuggingFace+Ray)

markoov · December 23, 2025, 10:08am

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.50.0
Python version: 3.11.10
transformers: 4.57.1
trl: 0.25.1

3. What happened vs. what you expected:

Expected: I would like to know if this situation is normal.
Actual: When training with code similar to the official example (deepspeed + huggingface + ray), I noticed that GPU memory usage is higher than normal. With parameters {num_workers=2, resources_per_worker = {"GPU": 1, "CPU": 4}}, the VRAM consumption is 23GB*2. However, for the same code, when I do not use ray, the VRAM consumption is 13GB*2.I initially suspected it was an issue related to data loading, specifically that the data might have been incorrectly duplicated in memory during the import phase. I tried using both the official data example train.get_dataset_shard("train") and the commonly used data loading method load_dataset, but obtained the same results.
Meanwhile, the training speed (4.5 minutes per epoch) is also slower than the normal situation (3.5 minutes per epoch). However, considering the added overhead from callbacks and communication costs, I can temporarily accept this issue.

python code

from datasets import load_dataset, Dataset
def load_data(self):
    ······
    data: Dataset = load_dataset("json", data_files = self.data_path, split = "train") # type: ignore
    data = data.map(_format_conversations, remove_columns = data.column_names, batched = True, batch_size = 16)
    ······
    # I also try this
    ray_datasets = {"train": ray.data.from_huggingface(data)}
    return ray_datasets

python code
from trl import SFTTrainer
from ray.train.huggingface.transformers import prepare_trainer
def train_func(self):
    ·······
    dataset = self.load_data()
    trainer = SFTTrainer(train_dataset=dataset,·······)
    trainer = prepare_trainer(trainer)
    trainer.train()
def ray_train(self):
    trainer = TorchTrainer(train_loop_per_worker = self.train_func,                scaling_config = ScalingConfig(
                    num_workers = 2,
                    use_gpu = True,
                    accelerator_type = "A100",
                    resources_per_worker = {"GPU": 1, "CPU": 4},
                    placement_strategy = "STRICT_PACK",
                ),······)

I tried using the official data example train.get_dataset_shard("train")

python code
train_ds = raytrain.get_dataset_shard("train")
eval_ds = raytrain.get_dataset_shard("validation")
train_ds_iterable = train_ds.iter_torch_batches(batch_size = self.trainParameter.batch_size,local_shuffle_buffer_size = raytrain.get_context().get_world_size() * self.trainParameter.batch_size,)
eval_ds_iterable = eval_ds.iter_torch_batches(batch_size = 2)
·······
trainer = SFTTrainer(self.model,
train_dataset = train_ds_iterable,
eval_dataset = eval_ds_iterable, ······)

but obtained the same results.

RunLLM · December 23, 2025, 10:08am

It seems likely that higher GPU memory usage when using Ray Train with DeepSpeed and HuggingFace, compared to standalone training, is not unexpected. Ray introduces additional overhead due to distributed coordination, data sharding, and possibly duplicated model or data objects in each worker process. This can result in each worker consuming more VRAM than in a non-Ray setup, even when using recommended data loading patterns like train.get_dataset_shard(“train”) or ray.data.from_huggingface(data). The observed slower training speed is also consistent with the extra communication and callback overhead in distributed Ray jobs, as noted in your description.

There is no clear evidence in the official documentation or examples that this is a bug; rather, it appears to be a known tradeoff of distributed training with Ray, especially when using DeepSpeed and HuggingFace. If memory usage is a concern, you might try reducing batch size, tuning DeepSpeed config, or profiling memory to ensure no unnecessary duplication, but some increase in VRAM usage is expected. For more on Ray’s memory usage and troubleshooting, see Ray memory debugging guide and Ray Data performance tips. Would you like a step-by-step breakdown or suggestions for further memory optimization?

Sources:

Hint: Mention @RunLLM in the post for followups.

markoov · December 23, 2025, 10:19am

my deepspeed config such as :

{

  "train_batch_size": "auto",

  "train_micro_batch_size_per_gpu": "auto",

  "gradient_accumulation_steps": "auto",

  "gradient_clipping": "auto",

  "zero_allow_untested_optimizer": true,

  "fp16": {

    "enabled": "auto",

    "loss_scale": 0,

    "loss_scale_window": 1000,

    "initial_scale_power": 16,

    "hysteresis": 2,

    "min_loss_scale": 1

  },

  "bf16": {

    "enabled": "auto"

  },

  "zero_stage": 2,

  "deepspeed_multinode_launcher": "standard",

  "zero3_init_flag": false,

  "zero_optimization": {

    "stage": 2,

    "allgather_partitions": true,

    "allgather_bucket_size": 5e8,

    "overlap_comm": true,

    "reduce_scatter": true,

    "reduce_bucket_size": 5e8,

    "contiguous_gradients": true,

    "round_robin_gradients": true

  }

}

Topic		Replies	Views
Ray job running with flash_attn cost triple GPU memory than run direct Configure Algorithm, Training, Evaluation, Scaling	1	75	October 24, 2024
Ray tune + deepspeed integration Ray Tune	1	71	February 21, 2025
Can't reproduce HF Transformers accuracy using Ray vs native GPU	1	344	June 23, 2023
Lightgbm Trainer for distribute training use too much memory Ray Train	1	137	January 27, 2025
[Ray Train] Memory overloading rapidly while training TensorFlow model Ray Train	12	2386	February 24, 2023

High GPU Memory (DeepSpeed+HuggingFace+Ray)

Related topics