Hi Ray Team, I found this statement on the Tune Training Class API:
As a rule of thumb, the execution time of step
should be large enough to avoid overheads (i.e. more than a few seconds), but short enough to report progress periodically (i.e. at most a few minutes).
I get the first part. A trial should take longer than just a second but what about the second part? What will happen if my trial runs for hours? Will I not be able to use the Trainable Class API?
Thanks!
Hey @max_ronda, thanks for posting to the forum!
You should be able to use the Trainable
API even if the execution time of step
is long.
…but short enough to report progress periodically (i.e. at most a few minutes).
My understanding is that there’s nothing wrong with long steps per se – it’s more that frequent reports are useful to provide observability. For example, if you report too infrequently, you wouldn’t know if your program froze.
@Max_Pumperla did I misinterpret this?
Hey there!
I think the point is that “step” is designed to periodically report something. If something takes several hours to finish, you could still use it, but it would somewhat defy the purpose. In that case, you’d most likely just want to report your results once at the very end. Of course, if you train an LLM on the entire internet and each step does in fact take days, that’s ok too!
This is not a statement about Tune’s technical limitations, but rather about intended usage.
1 Like
Thanks for the clarification @Max_Pumperla ! In my case, I am using Tune slightly different, not only to Tune a ML model but any objective function. So steps might take longer depending on what I am running.
Another quick question:
Should there be any noticeable difference using Function API
vs Trainable API
? From my testing, I found Trainable API scaled better. Is that because Trainable API uses Ray Actors
and spawns one Actor with one Step? While Function API uses threads within Ray Actor? Could you clarify that ?
Should there be any noticeable difference using Function API
vs Trainable API
?
There shouldn’t be. We convert functions to Trainable
s internally.
From my testing, I found Trainable API scaled better.
I’m surprised to hear this. Could you tell me more? In particular, by which metric did the trainable API scale better?
Hey @bveeramani I will post some benchmarks to showcase that
@max_ronda awesome, thanks!