Using Ray Cluster in Kubernetes and connecting from external Jupyter Notebook.
While running my notebook to fine tune a hugging face model, the kernel is killed in the step:
from ray.train.huggingface import HuggingFaceTrainer
from ray.air.config import ScalingConfig
from ray.data.preprocessors import Chain
trainer = HuggingFaceTrainer(
trainer_init_per_worker=trainer_init_per_worker,
trainer_init_config={
"batch_size": 16,
"epochs": 1,
},
scaling_config=ScalingConfig(
num_workers=num_workers,
use_gpu=use_gpu,
resources_per_worker={"GPU": 1, "CPU": cpus_per_worker},
),
datasets={"train": ray_datasets["train"], "evaluation": ray_datasets["validation"]},
preprocessor=Chain(splitter, tokenizer),
)
results = trainer.fit()
trainer.fit()
trains the model successfully but at the end the Kernel is killed while providing a warning :
UserWarning: Ray Client is attempting to retrieve a 5.53 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead
I’m unable find any Docs which can help me in solving the issue by using the serializing solution provided.
Any help would be much appreciated, Thank!
Versions:
Kubernetes Version : v1.25.6
Ray Version : 2.3.1
Python Version : 3.8