The `process_trial_save` operation took X s, which may be a performance bottleneck

0piero · January 30, 2023, 7:59am

I already posted a similar issue here and found other people with the same problem. However I’m still not able to solve my problem. In my train loop I save checkpoints as:

            dir = session.get_trial_dir()
            checkpoint = Checkpoint.from_directory(dir)
            for id, (model, opt) in enumerate(zip(model_type, optimizer)):
                torch.save((model, opt.state_dict()), os.path.join(dir, "checkpoint" + str(id)))
            chkpt = Checkpoint.from_dict({"loss": val_loss, "running_loss": running_loss, "training_iteration": epoch})
            with checkpoint.as_directory() as chkptdir:
                chkpt.to_directory(chkptdir)
            session.report({"loss": val_loss, "running_loss": running_loss}, checkpoint=checkpoint)

My model training is taking a huge time due to this problem. How to overcome this? Is this a known ray problem? Please let me know, I’m stuck.

arturn · March 8, 2023, 11:26pm

Hi @0piero ,

Some of RLlib’s workloads simply take way more time than other things tuned with tune.
Luckily, if you are not setting up loads of experiments, setting up the experiment is a one-time thing per run. So unless you are running many very short experiments, this should not take up a significant amount of your training time.
Because RLlib may rely on many Ray actors, setting up many, possibly larger, models etc, we simply do a couple of things that take time.
I hope this does not bar you from achieving your goals with RLlib!

Cheers

Topic		Replies	Views
Performance Bottleneck in saving model during training Ray Tune	4	487	January 31, 2023
Trial checkpointing	0	290	June 16, 2023
Trying to optimize training but finding documentation insufficient RLlib	6	721	September 11, 2022
Restore checkpoint saved with client-server RLlib	7	743	August 2, 2022
Tune Performance Bottlenecks Ray Tune	8	3563	February 8, 2021

The `process_trial_save` operation took X s, which may be a performance bottleneck

Related topics