How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello, I am trying to optimize my overall training process since I see these messages:
|2022-09-08 13:09:45,416|WARNING util.py:220 -- The `callbacks.on_trial_result` operation took 16.056 s, which may be a performance bottleneck.|
|---|---|
|2022-09-08 13:09:45,539|WARNING util.py:220 -- The `process_trial_result` operation took 16.182 s, which may be a performance bottleneck.|
|2022-09-08 13:09:45,539|WARNING util.py:220 -- Processing trial results took 16.182 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.|
|2022-09-08 13:09:45,539|WARNING util.py:220 -- The `process_trial_result` operation took 19.160 s, which may be a performance bottleneck.|
for a while now, and just ignored them because they didnāt seem like big numbers, until i scaled up my scenario and now itās taking too long to train. I read that it usually should only take 500ms or so for such processes. To fix this, I was going through Ray Tune FAQ ā Ray 2.0.0, but I am unable to implement the solutions because I am very confused by the terminology and am unable to find what they are taking about in the documentation with my limited understanding.
For example:
You are reporting results too often
Each result is processed by the search algorithm, trial scheduler, and callbacks (including loggers and the trial syncer). If youāre reporting a large number of results per trial (e.g. multiple results per second), this can take a long time.
Solution: The solution here is obvious: Just donāt report results that often. In class trainables,
step()
should maybe process a larger chunk of data. In function trainables, you can report only every n-th iteration of the training loop. Try to balance the number of results you really need to make scheduling or searching decisions. If you need more fine grained metrics for logging or tracking, consider using a separate logging mechanism for this instead of the Ray Tune-provided progress logging of results.
or
The Trial result is very large
This is the case if you return objects, data, or other large objects via the return value of
step()
in your class trainable or tosession.report()
in your function trainable. The effect is the same as above: The results are repeatedly serialized and written to disk, and this can take a long time.Solution: Use checkpoint by writing data to the trainableās current working directory instead. There are various ways to do that depending on whether you are using class or functional Trainable API.
The solution seems to outline the steps but I have no clue what needs to be changed in code.
-
I am using tune.run() to train with āPPOā, is that considered a function trainable or a class trainable?
-
I am returning the obs, rewards, dones, infos in the step() function of my custom environment - Should I not be returning these? But that was the function signature for step() function when I last checked - has this changed?
-
There are multiple settings that configure times for reporting, my current settings are:
keep_per_episode_custom_metrics = True, # default is False
metrics_episode_collection_timeout_s = 60.0,
metrics_num_episodes_for_smoothing = 100,
min_time_s_per_iteration = None,
min_train_timesteps_per_iteration = 0,
min_sample_timesteps_per_iteration = 0,
What would help me reduce frequency of reporting?
Please advise. Thank you.