Ray.get() becomes very slow when i increase the number of epochs

Medkne · July 3, 2022, 12:54pm

I use ray.get(current_weights) to get the final weights of my model after finishing all epochs. The problem is when i use for example 2 epochs, ray.get(current_weights) takes 17s, and with 100 epochs takes 1000s. I don’t know why ?

It seems that ray.get takes more time when we call remote function many time ?


    for epoch in range(n_epoch):
        start_epoch = time.time()
        for b in range(total_batch):
            gradients = [worker.compute_gradients.remote(current_weights) for worker in workers]
            current_weights = ps.apply_gradients.remote(*gradients)
        if (epoch + 1) % n_epoch == 0:
            weights = ray.get(current_weights) # This line is so slow if i increase n_epoch
            model.set_weights(weights)

I’m using parameter server for gradient sharing.

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Stephanie_Wang · July 5, 2022, 5:45pm

In general, ray.get will take longer if passed more objects due to the time needed to fetch all of those objects to the current process.

However, from your code snippet, it appears that the slowdown is actually just because your script is generating more work. The .remote calls are instant since the actual functions execute asynchronously, so all of the execution time will appear to be in the ray.get call. When n_epoch increases, the script submits more tasks to the worker and ps actors. The ray.get call has to wait for all of these actor tasks to finish before returning.

Medkne · July 6, 2022, 11:29am

Thank you @Stephanie_Wang for your answer. Do you have any suggestion to speed up the training ? I’m using parameter server to share the gradients between workers and parameter server.

Stephanie_Wang · July 6, 2022, 6:50pm

Unfortunately it is very hard to give specific advice for this since performance debugging in general is a difficult problem.

However, let me point you to the docs for debugging and profiling on Ray core. I would also highly recommend that you look into Ray Train, a library for distributed training, instead of building directly over Ray core APIs.

Topic		Replies	Views
Delay ray.get() seems cannot speed up for actors Ray Core	2	441	June 9, 2022
Ray + Fast API Performance Issues	0	404	April 9, 2022
Understanding the ray.get() method Ray Core	2	125	October 24, 2024
Feature request: Allow ray.wait() to do the necessary work for an instant ray.get() Ray Core	16	405	May 25, 2021
Ray ray.get very slow with distributed actors holding vectorial data Ray Core	0	160	January 21, 2024

Ray.get() becomes very slow when i increase the number of epochs

Related topics