How severe does this issue affect your experience of using Ray?
In Tune,
with open("myfile.txt", "w") as f:
f.write("hello world")
saves myfile.txt relative to the current trial’s logdir. That is, tune changes the current directory to the correct log directory by default.
However in Train the above code saves myfile.txt to $HOME
directory. Due to this, multiple trials overwrite each other.
To unblock myself, how can I know the current trial’s log directory inside trial_func
so I can change it manually myself.
Hey @Nitin_Pasumarthy, could you explain the use-case a little more? This gets tricky when combining Train and Tune because:
- The distributed Train workers may not exist on the same node as the Trial where the logs are found.
- There may be multiple Train workers on the same node, which could still overwrite each other.
Can whatever you’re trying to accomplish be handled with Callbacks instead?
That’s a good point. I want to save some artifacts during training. I have to think how to achieve this using callbacks. Regardless, why not this folder structure when Tune and Train are used together, which I think will become more common with bigger models?
# Node 1
tune_lr=0.1
tune_searcher_state.json
tune_exp_state.json
train_worker1
myfile.txt
tune_lr=0.3
...
train_worker1
myfile.txt
# Node 2
tune_lr=0.1
...
train_worker2
myfile.txt
tune_lr=0.3
...
train_worker2
myfile.txt
TLDR, change worker’s home to be under the corresponding tune’s trail folder. And this seems more natural expectation as an end user coming from Tune’s world.
Can this be achieved with checkpointing (train.save_checkpoint()
)?
Theoretically somewhat, based on the documentation. When I try it, I cannot find a file named train_stuff
either on head (where the training job was launched from) nor the workers.
def train_fn(config):
stdout = process.run( custom_model_train() ) # saves file1 to disk
train.save_checkpoint(train_stuff=stdout)
stdout = process.run( custom_eval() ) # uses local file1 from disk from above
train.save_checkpoint(eval_stuff=stdout)
trainer = Trainer(backend="tensorflow", num_workers=2, resources_per_worker={"GPU": 1, "CPU": 8}, use_gpu=True, max_retries=0)
trainable = trainer.to_tune_trainable(train_fn)
Even if the above is resolved, to pipeline my work, I save stuff to disk from job1 to consume them in a later stage. The entire pipeline work will be done in the same worker but in different processes. So if the checkpoints are not in the same worker, this approach will fail too.
I can see the checkpoints without Tune, i.e. if I just use Train to train on multiple workers
Hmm, I tried running the following simple example:
from ray.train import Trainer
from ray import train, tune
def train_func():
train.save_checkpoint(metric=123)
trainer = Trainer(backend="torch",num_workers=2)
trainable = trainer.to_tune_trainable(train_func)
tune.run(trainable)
And the checkpoint was written to:
~/tune_function_2022-03-21_22-02-05/tune_function_37f72_00000_0_2022-03-21_22-02-05/checkpoint_000000/checkpoint
The file isn’t expected to be named train_stuff
here.
Thank you, @matthewdeng This is helpful.
I ended up wrapping train_func
(the first arg of trainer.run) and trainer.run
itself to
- programmatically change the directory on each worker to
trainer.latest_run_dir
- sync any files saved to this directory to some blob storage (to achieve the functionality
SyncConfig
in tune, when using just train)
Thanks @Nitin_Pasumarthy. You mentioned that this is still a blocker in another thread. Did @matthewdeng’s suggestion on using train.save_checkpoint()
not work for you?
It doesn’t @amogkam . We want to organize the trial folder structure.
trainable = trainer.to_tune_trainable...
class Trainable2(trainable):
def setup(self, **kwargs):
self.config["tune_dir"] = self.logdir
return super().setup(**kwargs)
tune.run(Trainable2)
errors with
2022-06-22 07:31:18,972 ERROR trial_runner.py:920 -- Trial tune_function_4ba98_00000: Error processing event.
Traceback (most recent call last):
File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 886, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 675, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/home/jobuser/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/jobuser/.local/lib/python3.7/site-packages/ray/worker.py", line 1763, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::DRexTrainable.train() (pid=6894, ip=100.97.90.147, repr=tune_function)
File "/home/jobuser/.local/lib/python3.7/site-packages/ray/tune/trainable.py", line 319, in train
result = self.step()
File "/tmp/ipykernel_6606/3586159017.py", line 33, in step
File "/tmp/ipykernel_6606/3933378382.py", line 6, in train_fn
File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/session.py", line 332, in report
session = get_session()
File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/session.py", line 240, in get_session
raise ValueError("Trying to access a Train session that has not been "
ValueError: Trying to access a Train session that has not been initialized yet. Train functions like `train.report()` should only be called from inside the training function.
This approach also doesnt use GPUs from all workers / nodes, leaving the problem unsolved.
- Any ways I can extend this idea further to resolve my issue?
- Any API in Train which returns the current run directory?
train.latest_run_dir
returns None
when used with tune
.