Train with tune doesnt set the right logdir

How severe does this issue affect your experience of using Ray?

  • High: it blocks my work

In Tune,

with open("myfile.txt", "w") as f:
  f.write("hello world")

saves myfile.txt relative to the current trial’s logdir. That is, tune changes the current directory to the correct log directory by default.

However in Train the above code saves myfile.txt to $HOME directory. Due to this, multiple trials overwrite each other.

To unblock myself, how can I know the current trial’s log directory inside trial_func so I can change it manually myself.

Hey @Nitin_Pasumarthy, could you explain the use-case a little more? This gets tricky when combining Train and Tune because:

  1. The distributed Train workers may not exist on the same node as the Trial where the logs are found.
  2. There may be multiple Train workers on the same node, which could still overwrite each other.

Can whatever you’re trying to accomplish be handled with Callbacks instead?

That’s a good point. I want to save some artifacts during training. I have to think how to achieve this using callbacks. Regardless, why not this folder structure when Tune and Train are used together, which I think will become more common with bigger models?

# Node 1
tune_lr=0.1
  tune_searcher_state.json
  tune_exp_state.json
  train_worker1
    myfile.txt
tune_lr=0.3
  ...
  train_worker1
    myfile.txt

# Node 2
tune_lr=0.1
  ...
  train_worker2
    myfile.txt
tune_lr=0.3
  ...
  train_worker2
    myfile.txt

TLDR, change worker’s home to be under the corresponding tune’s trail folder. And this seems more natural expectation as an end user coming from Tune’s world.

Can this be achieved with checkpointing (train.save_checkpoint())?

Theoretically somewhat, based on the documentation. When I try it, I cannot find a file named train_stuff either on head (where the training job was launched from) nor the workers.

def train_fn(config):
  stdout = process.run( custom_model_train() ) # saves file1 to disk
  train.save_checkpoint(train_stuff=stdout)
  stdout = process.run( custom_eval() ) # uses local file1 from disk from above
  train.save_checkpoint(eval_stuff=stdout)

trainer = Trainer(backend="tensorflow", num_workers=2, resources_per_worker={"GPU": 1, "CPU": 8}, use_gpu=True, max_retries=0)
trainable = trainer.to_tune_trainable(train_fn)

Even if the above is resolved, to pipeline my work, I save stuff to disk from job1 to consume them in a later stage. The entire pipeline work will be done in the same worker but in different processes. So if the checkpoints are not in the same worker, this approach will fail too.

I can see the checkpoints without Tune, i.e. if I just use Train to train on multiple workers

Hmm, I tried running the following simple example:

from ray.train import Trainer
from ray import train, tune

def train_func():
    train.save_checkpoint(metric=123)

trainer = Trainer(backend="torch",num_workers=2)
trainable = trainer.to_tune_trainable(train_func)

tune.run(trainable)

And the checkpoint was written to:
~/tune_function_2022-03-21_22-02-05/tune_function_37f72_00000_0_2022-03-21_22-02-05/checkpoint_000000/checkpoint

The file isn’t expected to be named train_stuff here.

Thank you, @matthewdeng This is helpful.

I ended up wrapping train_func (the first arg of trainer.run) and trainer.run itself to

  1. programmatically change the directory on each worker to trainer.latest_run_dir
  2. sync any files saved to this directory to some blob storage (to achieve the functionality SyncConfig in tune, when using just train)