How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I use tune for two things:
- Tuning models (Validation-set and K-fold approach)
- Running N models to measure mean and std performance.
In the 2nd case, I want to see the model predictions from all the models, not just the test metric.
How do I fetch this?
The Trainable framework supports step
and save_checkpoint
, but neither really fit the use-case. I calculate metrics after each step
, and also on the final step
At the moment, I’ve resorted to some weird ideas like compressing the predictions DataFrame into a string and calling that a “metric”, but this honestly seems like a broken approach.
You’re correct, a “metric” should be a small, lightweight summary value of your training – something like training/validation loss.
You can save these as generic trial artifacts by writing to the current working directory in a Tune Trainable
.
For example:
from ray import tune
class MyTrainable(tune.Trainable):
def step(self):
# do training, then generate predictions
# predictions = ...
iteration = self.training_iteration
with open(f"./predictions_for_iter={iteration}.pt", "w") as f:
# Dump your pandas DF or save in whatever way you want!
# torch.save(predictions, f)
tuner = tune.Tuner(MyTrainable)
results = tuner.fit()
for result in results:
# All your artifacts are saved relative to the trial directory
print("Trial directory:", result.log_dir) # On ray<=2.4
print(os.listdir(result.log_dir)) # All of your artifacts should be here
# print("Trial directory:", result.path) # On ray>2.4 (on nightly as of 6/5)
This user guide section may also be helpful. It goes over how to get data out of tune in the form of an AIR checkpoint: Getting Data in and out of Tune — Ray 2.4.0