How severe does this issue affect your experience of using Ray?
In Tune,
with open("myfile.txt", "w") as f:
f.write("hello world")
saves myfile.txt relative to the current trial’s logdir. That is, tune changes the current directory to the correct log directory by default.
However in Train the above code saves myfile.txt to $HOME
directory. Due to this, multiple trials overwrite each other.
To unblock myself, how can I know the current trial’s log directory inside trial_func
so I can change it manually myself.
Hey @Nitin_Pasumarthy, could you explain the use-case a little more? This gets tricky when combining Train and Tune because:
- The distributed Train workers may not exist on the same node as the Trial where the logs are found.
- There may be multiple Train workers on the same node, which could still overwrite each other.
Can whatever you’re trying to accomplish be handled with Callbacks instead?
That’s a good point. I want to save some artifacts during training. I have to think how to achieve this using callbacks. Regardless, why not this folder structure when Tune and Train are used together, which I think will become more common with bigger models?
# Node 1
tune_lr=0.1
tune_searcher_state.json
tune_exp_state.json
train_worker1
myfile.txt
tune_lr=0.3
...
train_worker1
myfile.txt
# Node 2
tune_lr=0.1
...
train_worker2
myfile.txt
tune_lr=0.3
...
train_worker2
myfile.txt
TLDR, change worker’s home to be under the corresponding tune’s trail folder. And this seems more natural expectation as an end user coming from Tune’s world.
Can this be achieved with checkpointing (train.save_checkpoint()
)?
Theoretically somewhat, based on the documentation. When I try it, I cannot find a file named train_stuff
either on head (where the training job was launched from) nor the workers.
def train_fn(config):
stdout = process.run( custom_model_train() ) # saves file1 to disk
train.save_checkpoint(train_stuff=stdout)
stdout = process.run( custom_eval() ) # uses local file1 from disk from above
train.save_checkpoint(eval_stuff=stdout)
trainer = Trainer(backend="tensorflow", num_workers=2, resources_per_worker={"GPU": 1, "CPU": 8}, use_gpu=True, max_retries=0)
trainable = trainer.to_tune_trainable(train_fn)
Even if the above is resolved, to pipeline my work, I save stuff to disk from job1 to consume them in a later stage. The entire pipeline work will be done in the same worker but in different processes. So if the checkpoints are not in the same worker, this approach will fail too.
I can see the checkpoints without Tune, i.e. if I just use Train to train on multiple workers
Hmm, I tried running the following simple example:
from ray.train import Trainer
from ray import train, tune
def train_func():
train.save_checkpoint(metric=123)
trainer = Trainer(backend="torch",num_workers=2)
trainable = trainer.to_tune_trainable(train_func)
tune.run(trainable)
And the checkpoint was written to:
~/tune_function_2022-03-21_22-02-05/tune_function_37f72_00000_0_2022-03-21_22-02-05/checkpoint_000000/checkpoint
The file isn’t expected to be named train_stuff
here.
Thank you, @matthewdeng This is helpful.
I ended up wrapping train_func
(the first arg of trainer.run) and trainer.run
itself to
- programmatically change the directory on each worker to
trainer.latest_run_dir
- sync any files saved to this directory to some blob storage (to achieve the functionality
SyncConfig
in tune, when using just train)