Checkpointing simulator folders with RLlib and Tune

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am using an external simulator that writes result records into a folder. I have wrapped this simulator into an RLlib ExternalEnv. Each environment instance writes into another folder (I use uuid.uuid4().hex to create unique folder names).

What I want to achieve
I would like to keep with each checkpoint of RLlib’s policy also the corresponding result files from the simulator as these show me a lot about what the policy has achieved in the simulator. For example, when I sync with tune everything to the cloud, I want the simulator folders synced together at the same point in time with checkpoints. This should work on a ray cluster.

The ExternalEnv knows the names of the folders, but does not know where the checkpoints go and when.
The algorithm (or tune) knows where the checkpoint goes, but does not know the folder names of the simulator results.

How should this be done?

What I do so far is the following. I create a tune.LoggerCallback and therein modify the log_trial_save() method:

class ExternalEnvLogger(LoggerCallback):  
    def __init__(self, compress=False):
        self.compress = compress        
    def log_trial_save(self, trial: "Trial"):
        checkpoint_path = trial.checkpoint.dir_or_data
        checkpoint_dir = pathlib.Path(TrainableUtil.find_checkpoint_dir(checkpoint_path))
        for path in pathlib.Path(".").glob("worker-*"):
            worker_dir = path
        env_checkpoint_dir = checkpoint_dir.joinpath("env_out")
        env_working_dir = worker_dir.joinpath("env_out")
        if self.compress:
            shutil.make_archive(env_checkpoint_dir.as_posix(), "zip", worker_dir.as_posix(), "env_out")
            shutil.copytree(env_working_dir.as_posix(), env_checkpoint_dir.as_posix())

I however do not know if this is a safe way to do it, especially when syncing to cloud storage and running on a Ray cluster.

@kai @Yard1 Maybe you know, if this should do the work?

@gjoliver for visibility regarding RLLib checkpoints

Now sure what exactly is the format of the simulator output, but a wild idea is that can you save your simulator results as custom metrics on result dicts using Callbacks.on_train_result()?
Simulator results can be written to a well-known location. And the callback can remember things that have been attached so far, and new data that still have to be attached.

@gjoliver Thanks Jun, for taking a look at the problem. So, the output of the simulator contains a bunch of files, like .csv, .html,.sql, and a couple of text files (zipped around 1.7 MB). These are very valuable information as it can be analysed what the simulator has done during a certain episode.

So, this information cannot be stored to the results dictionary, but has to be collected from disk. If this is happening on different nodes it has to be synced somewhere. From a discussion with @Yard1 I got some ideas how it could work (and I guess most people would use Tune for running RLlib):

  1. Have the Trainable create a temporary directory and send the path to it together with its IP to the workers.
  2. Each worker saves its output to its own temporary directory.
  3. When the workers are done (e.g. with a rollout) let them sync their files with the temporary directory of the Trainable passed to them in 1.
  4. In Trainable.save_checkpoint() move the files from the temporary directory to the checkpoint directory.

I think this would be a great enhancement for the ExternalEnv as I assume that many simulators have their own file outputs. I am open to make a PR for this as I have to make it anyway, but some guidance or direction would help.

I feel like there is still something weird about this proposed setup, where we will sync a bunch of files written by external simulators onto driver node (using a static IP addr that may or may not be reachable from the external simulator), and use Tune simply as a file syncer.
It’s probably easier to write directly from the external simulators to cloud storage, and just figure out a scheme to cross-reference between checkpoints and external data files, for example using timestamps or something.

The direct writing to cloud storage is at first a plausible solution.
The thing that friightens me about this approach is that I would need then to manually configure gsutil on any worker node (as I have to always do for the master when setting up a cluster on GCP Compute Engine with Ray’s autoscaler). This might possibly be due to my own lack of knowledge, though. So, I gonna reinvestigate this approach.

In my opinion more elegantly (especially for post-analysis) would be to provide the user with a save_env_checkpoint() in the ExternalEnv that manages storing and syncing also in the more elaborate cluster setup. As soon as a user provides an ExternalEnv he has usually an external simulator that outputs its own log, error and result files.