[Tune] Help Clarifying a tip in Trainable Class API

max_ronda · October 28, 2022, 8:07pm

Hi Ray Team, I found this statement on the Tune Training Class API:

As a rule of thumb, the execution time of step should be large enough to avoid overheads (i.e. more than a few seconds), but short enough to report progress periodically (i.e. at most a few minutes).

I get the first part. A trial should take longer than just a second but what about the second part? What will happen if my trial runs for hours? Will I not be able to use the Trainable Class API?

Thanks!

bveeramani · November 1, 2022, 2:11am

Hey @max_ronda, thanks for posting to the forum!

You should be able to use the Trainable API even if the execution time of step is long.

…but short enough to report progress periodically (i.e. at most a few minutes).

My understanding is that there’s nothing wrong with long steps per se – it’s more that frequent reports are useful to provide observability. For example, if you report too infrequently, you wouldn’t know if your program froze.

@Max_Pumperla did I misinterpret this?

Max_Pumperla · November 1, 2022, 7:21am

Hey there!

I think the point is that “step” is designed to periodically report something. If something takes several hours to finish, you could still use it, but it would somewhat defy the purpose. In that case, you’d most likely just want to report your results once at the very end. Of course, if you train an LLM on the entire internet and each step does in fact take days, that’s ok too!

This is not a statement about Tune’s technical limitations, but rather about intended usage.

max_ronda · November 1, 2022, 3:25pm

Thanks for the clarification @Max_Pumperla ! In my case, I am using Tune slightly different, not only to Tune a ML model but any objective function. So steps might take longer depending on what I am running.

Another quick question:

Should there be any noticeable difference using Function API vs Trainable API? From my testing, I found Trainable API scaled better. Is that because Trainable API uses Ray Actors and spawns one Actor with one Step? While Function API uses threads within Ray Actor? Could you clarify that ?

bveeramani · November 1, 2022, 11:43pm

Should there be any noticeable difference using Function API vs Trainable API?

There shouldn’t be. We convert functions to Trainables internally.

From my testing, I found Trainable API scaled better.

I’m surprised to hear this. Could you tell me more? In particular, by which metric did the trainable API scale better?

max_ronda · November 8, 2022, 7:14pm

Hey @bveeramani I will post some benchmarks to showcase that

bveeramani · November 9, 2022, 8:15pm

@max_ronda awesome, thanks!

Topic		Replies	Views
Trying to optimize training but finding documentation insufficient RLlib	6	738	September 11, 2022
Running Tune with nonparallel function Ray Tune	3	300	May 21, 2021
Cleanest integration of TensorFlow `model.fit()` in the Ray Tune Class API `step` method Ray Tune	4	534	September 23, 2022
Ray tune class API vs function API Ray Tune	0	58	June 21, 2024
How to manipulate the `training iteration` for each trial in Ray Tune Ray Tune	3	571	February 21, 2023

[Tune] Help Clarifying a tip in Trainable Class API

Related topics