Trying to optimize training but finding documentation insufficient

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello, I am trying to optimize my overall training process since I see these messages:

|2022-09-08 13:09:45,416|WARNING util.py:220 -- The `callbacks.on_trial_result` operation took 16.056 s, which may be a performance bottleneck.|
|---|---|
|2022-09-08 13:09:45,539|WARNING util.py:220 -- The `process_trial_result` operation took 16.182 s, which may be a performance bottleneck.|
|2022-09-08 13:09:45,539|WARNING util.py:220 -- Processing trial results took 16.182 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.|
|2022-09-08 13:09:45,539|WARNING util.py:220 -- The `process_trial_result` operation took 19.160 s, which may be a performance bottleneck.|

for a while now, and just ignored them because they didn’t seem like big numbers, until i scaled up my scenario and now it’s taking too long to train. I read that it usually should only take 500ms or so for such processes. To fix this, I was going through Ray Tune FAQ — Ray 2.0.0, but I am unable to implement the solutions because I am very confused by the terminology and am unable to find what they are taking about in the documentation with my limited understanding.

For example:

You are reporting results too often

Each result is processed by the search algorithm, trial scheduler, and callbacks (including loggers and the trial syncer). If you’re reporting a large number of results per trial (e.g. multiple results per second), this can take a long time.

Solution: The solution here is obvious: Just don’t report results that often. In class trainables, step() should maybe process a larger chunk of data. In function trainables, you can report only every n-th iteration of the training loop. Try to balance the number of results you really need to make scheduling or searching decisions. If you need more fine grained metrics for logging or tracking, consider using a separate logging mechanism for this instead of the Ray Tune-provided progress logging of results.

or

The Trial result is very large

This is the case if you return objects, data, or other large objects via the return value of step() in your class trainable or to session.report() in your function trainable. The effect is the same as above: The results are repeatedly serialized and written to disk, and this can take a long time.

Solution: Use checkpoint by writing data to the trainable’s current working directory instead. There are various ways to do that depending on whether you are using class or functional Trainable API.

The solution seems to outline the steps but I have no clue what needs to be changed in code.

  1. I am using tune.run() to train with ‘PPO’, is that considered a function trainable or a class trainable?

  2. I am returning the obs, rewards, dones, infos in the step() function of my custom environment - Should I not be returning these? But that was the function signature for step() function when I last checked - has this changed?

  3. There are multiple settings that configure times for reporting, my current settings are:

keep_per_episode_custom_metrics = True, # default is False
metrics_episode_collection_timeout_s = 60.0,
metrics_num_episodes_for_smoothing = 100,
min_time_s_per_iteration = None,
min_train_timesteps_per_iteration = 0,
min_sample_timesteps_per_iteration = 0,

What would help me reduce frequency of reporting?

Please advise. Thank you.

Hi @hridayns ,

  1. PPO inherits from Trainable and is therefore a class trainable. You could write your won function to poll training results from, which would make up a function trainable.

  2. The values you are returning in your env are correct. But they won’t be correct soon, because gym 0.26 was released yesterday. This change is not part of RLLib today, but should be merged in the days to come.

  3. The reporting time and frequency do not change the course of the training. Therefore, reporting often (every couple of seconds) only makes sense if you are trying to get things to run or with super quick trainings. You can scale up the time between the reportings with min_time_s_per_iteration . Here is the docstring:

# Minimum time interval over which to accumulate within a single `train()` call.
# This value does not affect learning, only the number of times
# `self.step_attempt()` is called by `self.train()`.
# If - after one `step_attempt()`, the time limit has not been reached,
# will perform n more `step_attempt()` calls until this minimum time has been
# consumed. Set to 0 for no minimum time.
"min_time_s_per_iteration": 0,

Cheers

Hello, thank you for your response. I think I’ve understood 1., and I’m guessing I can’t make changes for 2. but it will automatically solve “The Trial result is very large” issue when I conform to those changes when it comes out.

As for 3.,

|2022-09-08 13:09:45,539|WARNING util.py:220 – Processing trial results took 16.182 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.|

In order to deal with this issue, how can I report results less frequently? Please advise. Thank you.

In order to report less frequently, up min_time_s_per_iteration.

1 Like

I set it to 0 (because it said “Set to 0 for no minimum time”) but it didn’t seem to change anything. I still get the same warnings.

If you set it to zero, it will report as soon as data is available to report.

Set to 0 for no minimum time.

Oh I’m so sorry, I feel quite stupid. I misinterpreted the wording somehow… Yes, so I just have to set min_time_s_per_iteration to a higher value. Understood! Thank you again.

1 Like