[Tune Class API + PyTorch] Possible to add more custom scalars+weights+biases to Tensorboard events file?

I’m using the ray tune class API. I see that the hyperparameters for all trials + some other metrics (e.g. time_this_iter_s) are passed to the tfevents file so that I can view them on Tensorboard.
However, I would like to pass more scalars (e.g. loss function value)/histograms (the network’s weights and biases) to Tensorboard.
When using PyTorch only, this is possible to do it like this (source):

from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
writer.add_scalar('epoch_loss', epoch_loss, current_epoch)
for name, values in model.named_parameters():
     tb_writer.add_histogram(name, values, current_epoch)
     tb_writer.add_histogram(f'{name}.grad', values.grad, current_epoch)

Is it somehow possible yet to merge this approach to ray, e.g. by passing the writer object to ray? Or is there a another way to add more information to the tfevents file created by ray tune?

Ray version: 1.0.1.post1
PyTorch version: 1.7.0

Hello, I’m trying to do something similar. Did you find a solution?

Hi not really, unfortunately. My current workaround is to use the function API instead and then create a “normal” PyTorch SummaryWriter (from torch.utils.tensorboard) in addition to the file created by ray (that I’m not using). I am then logging train/validate loss weights + gradients and hparams, but this is creating multiple files per run/ray trial, which gets messy when viewing the files in Tensorboard. So currently I’m not happy with that solution yet, because I want to have only a single tfevents file per trail.

I think the solution will be to customize ray tune’s logging, but I do not yet understand how to do this, also not after having read the (probably relevant) parts of the documentation:
How to customize logging: Loggers (tune.logger) — Ray v2.0.0.dev0
Sourcecode of ray’s Tensorboard logger class: ray.tune.logger — Ray v2.0.0.dev0

Based on that, do you have any idea how to modify (= I think inherit + change the respective classes) to change the default tensorboard logging behaviour by ray?

Thank you for the logger links. I think the Trainable Logging section can be the solution because it says so. But I don’t know how to use it and couldn’t find some example coding using it. If you see an example, please let me know.

In the same page there is a Wandb logger example. I think it might help you. I think it hosts weights/bias on their server so you don’t have a single location, but at least it is managed.

About your pytorch SummaryWriter, would you share some code on how you did it? Could it run in the ray cluster or just local?

[edit] Wandb is in the master docs, but not in the v1.1 docs.
https://docs.ray.io/en/master/tune/api_docs/logging.html#wandblogger

Yeah! Sorry for the late reply. You can disable Tune’s default tensorboard logging by doing something like:

os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
from ray.tune.logger import JsonLoggerCallback, CSVLoggerCallback
base_callbacks = [CSVLoggerCallback(), JsonLoggerCallback()]
tune.run(callbacks=base_callbacks]

Thanks for the link to weights & biases! Might be a solution for me, but I still want to try for a bit to get this to work with ray + tensorboard before switching tools.
Below is an example of how I’m currently doing it. My code creates an additional folder called runs containing everything i am adding to PyTorch’s writer_tb. The runs folder is located inside the folder created by ray for each trail.
As I am using the function API, the function train_with_tune is what I pass to tune.run(). AFAIK it works both in local and non-local mode.
Let me know if you have questions/comments about this.

from torch.utils.tensorboard import SummaryWriter

def train_with_tune(config, checkpoint_dir = None):
	# set up everthing else for training
	...
    # set up logging to Tensorboard
    writer_tb = SummaryWriter(comment = '_custom_writer', flush_secs = 30)
    
    # training loop
    for epoch in range(config.get('number_of_epochs')):
    	# do training + validation
    	...
    	# report to tune
        tune.report(training_loss = train_epoch_loss, 
           reconstruction_loss = loss_dict_epoch.get('reconstruction_loss'),
           accuracy_loss = loss_dict_epoch.get('accuracy_loss'),
           anonymization_loss = loss_dict_epoch.get(temp_name),
           validation_loss = validate_epoch_loss)
    	
    	# log to tensorboard
    	# scalars
        loss_metric_dict = {'training_loss': train_epoch_loss,
           'validation_loss': validate_epoch_loss,
           'reconstruction_loss': loss_dict_epoch.get('reconstruction_loss'),
           'accuracy_loss': loss_dict_epoch.get('accuracy_loss'),
           'anonymization_loss': loss_dict_epoch.get(temp_name)}
        writer_tb.add_scalars('loss_values', loss_metric_dict, epoch)
        # weights + biases + their gradients
        for name, values in model.named_parameters():
            writer_tb.add_histogram(name, values, epoch)
            writer_tb.add_histogram(f'{name}.grad', values.grad, epoch)
            
            
    # add hparams after last epoch
    writer_tb.add_hparams(hparam_dict = config, metric_dict = loss_metric_dict)

Thanks for your reply! I see what this does, but if I then use PyTorch’s SummaryWriter (see code excerpts above) I still have the PyTorch-specific problem (also discussed here: [Tensorboard] Problem with subfolders from SummaryWriter · Issue #32651 · pytorch/pytorch · GitHub) that I get multiple runs in Tensorboard (because multiple files are created inside runs folder).
In my case it looks like this, because I am using add_scalars to group multiple metrics to create a better overview in Tensorboard:
Screenshot_20210205_100512

So based on that, I’d have some specific questions:

  1. Is it possible to use tune.report() to pass hparams and weights + their gradients to the tfevents file created by TBXLogger ?
  2. If this is done, will ray write all this information to a single tfevents file? If it is necessary to do this outside of tune.report() , how can this be achieved? (I assume by modifying the functions of the TBXLoggerCallback class, but maybe you could elaborate on how to do this in my case/example case?)
  3. Maybe taking a step back: Do you have another idea how to approach this with ray + maybe another visualization tool?

Thanks for taking a look at this!

Hey @lena-schwert sorry for the slow reply!

One option instead of using the Pytorch Tensorboard logger is to actually just instantiate a TBXLogger in your code. By doing that, you can access the TBXLogger._file_writer to dump hparams/gradients into a single tfevents file. I think this is probably the easiest thing to do (in my opinion)

Another approach as you mentioned is to subclass TBXLoggerCallback. What you’d do here is subclass the Callback and then override log_trial_result:

class LenaCallback(TBXLoggerCallback):
    def log_trial_result(self, iteration: int, trial: "Trial", result: Dict):
          super().log_trial_result(iteration, trial, result)
          # process gradients from the result

Then, in your trainable class you can try passing in weights/gradients via the tune.report() call.

Hi @rliaw,

If I’m running PPO using tune.run("PPO", config), is there any way to access the default TBXLogger._file_writer from within MyCustomCallback.on_trial_result() if MyCustomCallback is a subclass of ray.tune.callback.Callback?

Should I perhaps subclass LoggerCallback or TBXLoggerCallback instead?

In addition to keeping track of hparams, I’m looking to render the last frame of my environment to Tensorboard using tensorboardX’s writer.add_figure() method.

Thanks!

You could try this: [RLlib] Writing to tensorboard during custom evaluation - #3 by RickLan

I used a custom eval function. An image will be written to tensorboard every evaluation_interval.

Thanks for the reply @RickLan, using an eval function seems to be a great idea. I’m currently using the monitor=True flag to generate .mp4 files for eval episodes. Does eval_fn get called every step (like on_episode_step or at the end of each eval episode (like on_episode_end)?

I think a custom_eval_function will replace the default one. It is called every evaluation_interval training iterations. I believe callback like on_episode_step or on_episode_end will still be executed during the custom eval (Line 823). I don’t know where is the code for generating mp4. If it will not be called during this entire process, perhaps copy and paste it into the custom eval. @sven1977?

Config:

Default implementation: