ray.train.Trainer will autoscale?

I have set up the ray cluster with autoscaler and max_workers to 10, it should scale a maximum of 10 workers when needed.

I am trying to train a model with num_workers to 2 in ray.train.Trainer , training is getting started with 1 head and scaled 1 worker node . when i have the memory peak, the workers are dead, its not scaled further , even though i have a ray cluster setup of 10 workers.

need to know how will Trainer utilizes all the workers in the ray clusters when there is some memory surge

Thanks in advance

Hey @siva14guru,

Can you explain a little more about the memory peak you’re experiencing? Would it be possible to calculate this beforehand and set up the Trainer with the number of workers needed to handle this amount of memory?

hey @matthewdeng
thank you for your reply!
i am trying to train a dataset in resnet 101 arch. i am giving crop size of 96,96 , 4 batch per workers, 3 workers, workers of size 1 cpu and 3 GB memory.

while training starts memory went full. i dont know how to calculate this before.
does ray is giving any API to do the same?
if you share any resource on calculating cpu,memoy and gpu memory before training it will be really helpfull

Thank you in advance

Hi @siva14guru, Ray autoscaling is based on fixed resource requests that you specify. So if you specify 3 workers and 1 CPU + 3 GB per worker, Ray will autoscale the cluster to have at least 3 CPUs and 9 GB in the cluster (and go up to the next available unit in the node types you defined).

Ray does not restrict memory usage of your workers on a technical level though. It is your responsibility to not use more memory than you allocated to your workers.

So if you’re running out of memory with 1 CPU + 3GB memory per worker, you can try allocating e.g. 1 CPU + 6 GB per worker instead.

@kai thank you for the reply

I understood that if memory is not enough for workers we can increase the memory of the worker and retry it. initially, I thought ray will scale workers on memory . after @matthewdeng mentioned we can plan for the number of workers before the training, need to know how it is calculated to design worker resource

You can use memory as a resource string if that helps with the scaling configuration. The amount of memory needed for your training function depends completely on your training function and data, this is something that only you can calculate.