Train with tune doesnt set the right logdir

Nitin_Pasumarthy · March 20, 2022, 1:51am

How severe does this issue affect your experience of using Ray?

High: it blocks my work

In Tune,

with open("myfile.txt", "w") as f:
  f.write("hello world")

saves myfile.txt relative to the current trial’s logdir. That is, tune changes the current directory to the correct log directory by default.

However in Train the above code saves myfile.txt to $HOME directory. Due to this, multiple trials overwrite each other.

To unblock myself, how can I know the current trial’s log directory inside trial_func so I can change it manually myself.

matthewdeng · March 20, 2022, 11:00pm

Hey @Nitin_Pasumarthy, could you explain the use-case a little more? This gets tricky when combining Train and Tune because:

The distributed Train workers may not exist on the same node as the Trial where the logs are found.
There may be multiple Train workers on the same node, which could still overwrite each other.

Can whatever you’re trying to accomplish be handled with Callbacks instead?

Nitin_Pasumarthy · March 21, 2022, 7:15am

That’s a good point. I want to save some artifacts during training. I have to think how to achieve this using callbacks. Regardless, why not this folder structure when Tune and Train are used together, which I think will become more common with bigger models?

# Node 1
tune_lr=0.1
  tune_searcher_state.json
  tune_exp_state.json
  train_worker1
    myfile.txt
tune_lr=0.3
  ...
  train_worker1
    myfile.txt

# Node 2
tune_lr=0.1
  ...
  train_worker2
    myfile.txt
tune_lr=0.3
  ...
  train_worker2
    myfile.txt

TLDR, change worker’s home to be under the corresponding tune’s trail folder. And this seems more natural expectation as an end user coming from Tune’s world.

matthewdeng · March 21, 2022, 3:30pm

Can this be achieved with checkpointing (train.save_checkpoint())?

Nitin_Pasumarthy · March 21, 2022, 7:43pm

Theoretically somewhat, based on the documentation. When I try it, I cannot find a file named train_stuff either on head (where the training job was launched from) nor the workers.

def train_fn(config):
  stdout = process.run( custom_model_train() ) # saves file1 to disk
  train.save_checkpoint(train_stuff=stdout)
  stdout = process.run( custom_eval() ) # uses local file1 from disk from above
  train.save_checkpoint(eval_stuff=stdout)

trainer = Trainer(backend="tensorflow", num_workers=2, resources_per_worker={"GPU": 1, "CPU": 8}, use_gpu=True, max_retries=0)
trainable = trainer.to_tune_trainable(train_fn)

Even if the above is resolved, to pipeline my work, I save stuff to disk from job1 to consume them in a later stage. The entire pipeline work will be done in the same worker but in different processes. So if the checkpoints are not in the same worker, this approach will fail too.

Nitin_Pasumarthy · March 21, 2022, 11:45pm

I can see the checkpoints without Tune, i.e. if I just use Train to train on multiple workers

matthewdeng · March 22, 2022, 5:09am

Hmm, I tried running the following simple example:

from ray.train import Trainer
from ray import train, tune

def train_func():
    train.save_checkpoint(metric=123)

trainer = Trainer(backend="torch",num_workers=2)
trainable = trainer.to_tune_trainable(train_func)

tune.run(trainable)

And the checkpoint was written to:
~/tune_function_2022-03-21_22-02-05/tune_function_37f72_00000_0_2022-03-21_22-02-05/checkpoint_000000/checkpoint

The file isn’t expected to be named train_stuff here.

Nitin_Pasumarthy · March 25, 2022, 7:31am

Thank you, @matthewdeng This is helpful.

I ended up wrapping train_func (the first arg of trainer.run) and trainer.run itself to

programmatically change the directory on each worker to trainer.latest_run_dir
sync any files saved to this directory to some blob storage (to achieve the functionality SyncConfig in tune, when using just train)

amogkam · June 8, 2022, 9:21pm

Thanks @Nitin_Pasumarthy. You mentioned that this is still a blocker in another thread. Did @matthewdeng’s suggestion on using train.save_checkpoint() not work for you?

Nitin_Pasumarthy · June 23, 2022, 5:26am

It doesn’t @amogkam . We want to organize the trial folder structure.

trainable = trainer.to_tune_trainable...

class Trainable2(trainable):
    def setup(self, **kwargs):
        self.config["tune_dir"] = self.logdir
        return super().setup(**kwargs)

tune.run(Trainable2)

errors with

2022-06-22 07:31:18,972	ERROR trial_runner.py:920 -- Trial tune_function_4ba98_00000: Error processing event.
Traceback (most recent call last):
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 886, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 675, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/worker.py", line 1763, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::DRexTrainable.train() (pid=6894, ip=100.97.90.147, repr=tune_function)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/trainable.py", line 319, in train
    result = self.step()
  File "/tmp/ipykernel_6606/3586159017.py", line 33, in step
  File "/tmp/ipykernel_6606/3933378382.py", line 6, in train_fn
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/session.py", line 332, in report
    session = get_session()
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/session.py", line 240, in get_session
    raise ValueError("Trying to access a Train session that has not been "
ValueError: Trying to access a Train session that has not been initialized yet. Train functions like `train.report()` should only be called from inside the training function.

This approach also doesnt use GPUs from all workers / nodes, leaving the problem unsolved.

Any ways I can extend this idea further to resolve my issue?
Any API in Train which returns the current run directory? train.latest_run_dir returns None when used with tune.

Topic		Replies	Views
Tune results saved in ~/ray_results in addition to local storage_dir if TUNE_RESULT_DIR not set Ray Tune	5	1046	March 14, 2024
Ray Tune Sync Config not syncing logs from worker node Ray Tune	2	422	June 19, 2024
How to set directory where checkpoints are saved	2	525	December 14, 2023
How to get the logdir of current run when using a function together with Tuner	2	313	October 12, 2022
Trouble with some results from Ray Tune	1	42	August 7, 2024

Train with tune doesnt set the right logdir

Related topics