Train with tune doesnt set the right logdir

How severe does this issue affect your experience of using Ray?

  • High: it blocks my work

In Tune,

with open("myfile.txt", "w") as f:
  f.write("hello world")

saves myfile.txt relative to the current trial’s logdir. That is, tune changes the current directory to the correct log directory by default.

However in Train the above code saves myfile.txt to $HOME directory. Due to this, multiple trials overwrite each other.

To unblock myself, how can I know the current trial’s log directory inside trial_func so I can change it manually myself.

Hey @Nitin_Pasumarthy, could you explain the use-case a little more? This gets tricky when combining Train and Tune because:

  1. The distributed Train workers may not exist on the same node as the Trial where the logs are found.
  2. There may be multiple Train workers on the same node, which could still overwrite each other.

Can whatever you’re trying to accomplish be handled with Callbacks instead?

That’s a good point. I want to save some artifacts during training. I have to think how to achieve this using callbacks. Regardless, why not this folder structure when Tune and Train are used together, which I think will become more common with bigger models?

# Node 1
tune_lr=0.1
  tune_searcher_state.json
  tune_exp_state.json
  train_worker1
    myfile.txt
tune_lr=0.3
  ...
  train_worker1
    myfile.txt

# Node 2
tune_lr=0.1
  ...
  train_worker2
    myfile.txt
tune_lr=0.3
  ...
  train_worker2
    myfile.txt

TLDR, change worker’s home to be under the corresponding tune’s trail folder. And this seems more natural expectation as an end user coming from Tune’s world.

Can this be achieved with checkpointing (train.save_checkpoint())?

Theoretically somewhat, based on the documentation. When I try it, I cannot find a file named train_stuff either on head (where the training job was launched from) nor the workers.

def train_fn(config):
  stdout = process.run( custom_model_train() ) # saves file1 to disk
  train.save_checkpoint(train_stuff=stdout)
  stdout = process.run( custom_eval() ) # uses local file1 from disk from above
  train.save_checkpoint(eval_stuff=stdout)

trainer = Trainer(backend="tensorflow", num_workers=2, resources_per_worker={"GPU": 1, "CPU": 8}, use_gpu=True, max_retries=0)
trainable = trainer.to_tune_trainable(train_fn)

Even if the above is resolved, to pipeline my work, I save stuff to disk from job1 to consume them in a later stage. The entire pipeline work will be done in the same worker but in different processes. So if the checkpoints are not in the same worker, this approach will fail too.

I can see the checkpoints without Tune, i.e. if I just use Train to train on multiple workers

Hmm, I tried running the following simple example:

from ray.train import Trainer
from ray import train, tune

def train_func():
    train.save_checkpoint(metric=123)

trainer = Trainer(backend="torch",num_workers=2)
trainable = trainer.to_tune_trainable(train_func)

tune.run(trainable)

And the checkpoint was written to:
~/tune_function_2022-03-21_22-02-05/tune_function_37f72_00000_0_2022-03-21_22-02-05/checkpoint_000000/checkpoint

The file isn’t expected to be named train_stuff here.

Thank you, @matthewdeng This is helpful.

I ended up wrapping train_func (the first arg of trainer.run) and trainer.run itself to

  1. programmatically change the directory on each worker to trainer.latest_run_dir
  2. sync any files saved to this directory to some blob storage (to achieve the functionality SyncConfig in tune, when using just train)

Thanks @Nitin_Pasumarthy. You mentioned that this is still a blocker in another thread. Did @matthewdeng’s suggestion on using train.save_checkpoint() not work for you?

It doesn’t @amogkam . We want to organize the trial folder structure.

trainable = trainer.to_tune_trainable...

class Trainable2(trainable):
    def setup(self, **kwargs):
        self.config["tune_dir"] = self.logdir
        return super().setup(**kwargs)

tune.run(Trainable2)

errors with

2022-06-22 07:31:18,972	ERROR trial_runner.py:920 -- Trial tune_function_4ba98_00000: Error processing event.
Traceback (most recent call last):
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 886, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 675, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/worker.py", line 1763, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::DRexTrainable.train() (pid=6894, ip=100.97.90.147, repr=tune_function)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/trainable.py", line 319, in train
    result = self.step()
  File "/tmp/ipykernel_6606/3586159017.py", line 33, in step
  File "/tmp/ipykernel_6606/3933378382.py", line 6, in train_fn
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/session.py", line 332, in report
    session = get_session()
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/session.py", line 240, in get_session
    raise ValueError("Trying to access a Train session that has not been "
ValueError: Trying to access a Train session that has not been initialized yet. Train functions like `train.report()` should only be called from inside the training function.

This approach also doesnt use GPUs from all workers / nodes, leaving the problem unsolved.

  1. Any ways I can extend this idea further to resolve my issue?
  2. Any API in Train which returns the current run directory? train.latest_run_dir returns None when used with tune.