1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.50.0
- Python version: 3.11.10
- transformers: 4.57.1
- trl: 0.25.1
3. What happened vs. what you expected:
- Expected: I would like to know if this situation is normal.
- Actual: When training with code similar to the official example (deepspeed + huggingface + ray), I noticed that GPU memory usage is higher than normal. With parameters
{num_workers=2, resources_per_worker = {"GPU": 1, "CPU": 4}}, the VRAM consumption is 23GB*2. However, for the same code, when I do not use ray, the VRAM consumption is 13GB*2.I initially suspected it was an issue related to data loading, specifically that the data might have been incorrectly duplicated in memory during the import phase. I tried using both the official data exampletrain.get_dataset_shard("train")and the commonly used data loading methodload_dataset, but obtained the same results. - Meanwhile, the training speed (4.5 minutes per epoch) is also slower than the normal situation (3.5 minutes per epoch). However, considering the added overhead from callbacks and communication costs, I can temporarily accept this issue.
python code
from datasets import load_dataset, Dataset
def load_data(self):
······
data: Dataset = load_dataset("json", data_files = self.data_path, split = "train") # type: ignore
data = data.map(_format_conversations, remove_columns = data.column_names, batched = True, batch_size = 16)
······
# I also try this
ray_datasets = {"train": ray.data.from_huggingface(data)}
return ray_datasets
python code
from trl import SFTTrainer
from ray.train.huggingface.transformers import prepare_trainer
def train_func(self):
·······
dataset = self.load_data()
trainer = SFTTrainer(train_dataset=dataset,·······)
trainer = prepare_trainer(trainer)
trainer.train()
def ray_train(self):
trainer = TorchTrainer(train_loop_per_worker = self.train_func, scaling_config = ScalingConfig(
num_workers = 2,
use_gpu = True,
accelerator_type = "A100",
resources_per_worker = {"GPU": 1, "CPU": 4},
placement_strategy = "STRICT_PACK",
),······)
- I tried using the official data example
train.get_dataset_shard("train")
python code
train_ds = raytrain.get_dataset_shard("train")
eval_ds = raytrain.get_dataset_shard("validation")
train_ds_iterable = train_ds.iter_torch_batches(batch_size = self.trainParameter.batch_size,local_shuffle_buffer_size = raytrain.get_context().get_world_size() * self.trainParameter.batch_size,)
eval_ds_iterable = eval_ds.iter_torch_batches(batch_size = 2)
·······
trainer = SFTTrainer(self.model,
train_dataset = train_ds_iterable,
eval_dataset = eval_ds_iterable, ······)
- but obtained the same results.