ray.train.Trainer will autoscale?

siva14guru · May 23, 2022, 12:52pm

I have set up the ray cluster with autoscaler and max_workers to 10, it should scale a maximum of 10 workers when needed.

I am trying to train a model with num_workers to 2 in ray.train.Trainer , training is getting started with 1 head and scaled 1 worker node . when i have the memory peak, the workers are dead, its not scaled further , even though i have a ray cluster setup of 10 workers.

need to know how will Trainer utilizes all the workers in the ray clusters when there is some memory surge

Thanks in advance

matthewdeng · May 23, 2022, 11:49pm

Hey @siva14guru,

Can you explain a little more about the memory peak you’re experiencing? Would it be possible to calculate this beforehand and set up the Trainer with the number of workers needed to handle this amount of memory?

siva14guru · May 24, 2022, 5:48am

hey @matthewdeng
thank you for your reply!
i am trying to train a dataset in resnet 101 arch. i am giving crop size of 96,96 , 4 batch per workers, 3 workers, workers of size 1 cpu and 3 GB memory.

while training starts memory went full. i dont know how to calculate this before.
does ray is giving any API to do the same?
if you share any resource on calculating cpu,memoy and gpu memory before training it will be really helpfull

Thank you in advance

kai · May 24, 2022, 10:05am

Hi @siva14guru, Ray autoscaling is based on fixed resource requests that you specify. So if you specify 3 workers and 1 CPU + 3 GB per worker, Ray will autoscale the cluster to have at least 3 CPUs and 9 GB in the cluster (and go up to the next available unit in the node types you defined).

Ray does not restrict memory usage of your workers on a technical level though. It is your responsibility to not use more memory than you allocated to your workers.

So if you’re running out of memory with 1 CPU + 3GB memory per worker, you can try allocating e.g. 1 CPU + 6 GB per worker instead.

siva14guru · May 24, 2022, 1:31pm

@kai thank you for the reply

I understood that if memory is not enough for workers we can increase the memory of the worker and retry it. initially, I thought ray will scale workers on memory . after @matthewdeng mentioned we can plan for the number of workers before the training, need to know how it is calculated to design worker resource

kai · May 31, 2022, 8:36am

You can use memory as a resource string if that helps with the scaling configuration. The amount of memory needed for your training function depends completely on your training function and data, this is something that only you can calculate.

Topic		Replies	Views
[Autoscaler] Autoscaler on ray 1.3 with minikube does not scale down Ray Clusters	2	385	June 3, 2021
Autoscaling not working with ray.util.multiprocessing Kubernetes	5	778	June 17, 2021
Why is my autoscaling cluster not scaling up to max when tuning? Ray Tune	1	17	March 31, 2025
[Autoscaler] Autoscaler behavior for changes to min_workers for deployed cluster Ray Clusters	2	319	June 3, 2021
How do I ask Ray to autoscale the resources for tuning? Ray Tune	7	412	March 9, 2021

ray.train.Trainer will autoscale?

Related topics