The `process_trial_save` operation took X s, which may be a performance bottleneck

I already posted a similar issue here and found other people with the same problem. However I’m still not able to solve my problem. In my train loop I save checkpoints as:

            dir = session.get_trial_dir()
            checkpoint = Checkpoint.from_directory(dir)
            for id, (model, opt) in enumerate(zip(model_type, optimizer)):
                torch.save((model, opt.state_dict()), os.path.join(dir, "checkpoint" + str(id)))
            chkpt = Checkpoint.from_dict({"loss": val_loss, "running_loss": running_loss, "training_iteration": epoch})
            with checkpoint.as_directory() as chkptdir:
                chkpt.to_directory(chkptdir)
            session.report({"loss": val_loss, "running_loss": running_loss}, checkpoint=checkpoint)

My model training is taking a huge time due to this problem. How to overcome this? Is this a known ray problem? Please let me know, I’m stuck.

Hi @0piero ,

Some of RLlib’s workloads simply take way more time than other things tuned with tune.
Luckily, if you are not setting up loads of experiments, setting up the experiment is a one-time thing per run. So unless you are running many very short experiments, this should not take up a significant amount of your training time.
Because RLlib may rely on many Ray actors, setting up many, possibly larger, models etc, we simply do a couple of things that take time.
I hope this does not bar you from achieving your goals with RLlib!

Cheers