The `process_trial_save` operation took X s, which may be a performance bottleneck

I already posted a similar issue here and found other people with the same problem. However I’m still not able to solve my problem. In my train loop I save checkpoints as:

            dir = session.get_trial_dir()
            checkpoint = Checkpoint.from_directory(dir)
            for id, (model, opt) in enumerate(zip(model_type, optimizer)):
      , opt.state_dict()), os.path.join(dir, "checkpoint" + str(id)))
            chkpt = Checkpoint.from_dict({"loss": val_loss, "running_loss": running_loss, "training_iteration": epoch})
            with checkpoint.as_directory() as chkptdir:
  {"loss": val_loss, "running_loss": running_loss}, checkpoint=checkpoint)

My model training is taking a huge time due to this problem. How to overcome this? Is this a known ray problem? Please let me know, I’m stuck.

Some of RLlib’s workloads simply take way more time than other things tuned with tune.
Luckily, if you are not setting up loads of experiments, setting up the experiment is a one-time thing per run. So unless you are running many very short experiments, this should not take up a significant amount of your training time.
Because RLlib may rely on many Ray actors, setting up many, possibly larger, models etc, we simply do a couple of things that take time.
I hope this does not bar you from achieving your goals with RLlib!