Hi Tune community, I am testing out Ray Tune on different classical optimization functions but have been slowed down by the constant checkpointing Ray Tune is doing. Is there a way to turn off checkpointing ? Thank you !
Hi @max_ronda, what happens if you set in your CheckpointConfig
Çąum_to_keep=0`? Like this:
from ray import air, tune
tuner = tune.Tuner(
my_trainable,
run_config=air.RunConfig(
checkpoint_config=air.CheckpointConfig(
num_to_keep=0,
)
)
Continuing the discussion from [Tune] How to turn off checkpointing for testing:
Hi @Lars_Simon_Zehnder , I tried that. It looks like it still wants to create a path to log and it fails.
Throws something like this:
TuneError: Tune run failed. Please use tuner = Tuner.restore("/home/ray_results/problem_my_func_2022-10-10_08-42-30") to resume.
This is my code btw. Very simple setup.
tuner = tune.Tuner(
my_func,
tune_config=tune.TuneConfig(
metric="mean_loss",
mode="min",
search_alg=algo,
num_samples=num_samples
),
param_space=search_space,
run_config=air.RunConfig(
checkpoint_config=air.CheckpointConfig(num_to_keep=0)
# local_dir=log_dir,
)
)
Any other thoughts ? Looks like this is a bug
@max_ronda, interesting. Could you also try out to use instead num_to_keep=0
checkpoint_frequency=0
?
Hi @Lars_Simon_Zehnder , I tried that and framework went ahead and still created a log path under ~/ray_results/
. I also tried both:
checkpoint_config=air.CheckpointConfig(checkpoint_frequency=0, num_to_keep=0)
And that failed because of num_to_keep=0
, I assume.
Any other thoughts? Curious if that works for you ?
Thanks!
@max_ronda I might have been a little imprecise. I meant leaving out the num_to_keep
argument and instead using the checkpoint_frequency
argument.
But could you control, if there are still checkpoints under the log path. It is normal that Ray creates the log path as it also logs error files and tensorboard event files into it.
Hi @Lars_Simon_Zehnder , apologies, I mean to say I tried both methods. But now I’ve tried all permutations:
-
air.CheckpointConfig(num_to_keep=0)
This fails. Wants to create a log dir within /ray_results/ but throws errors about wanting to restore from checkpoint. -
air.CheckpointConfig(checkpoint_frequency=0)
This runs but still creates log directories and checkpoints for each trial. -
air.CheckpointConfig(checkpoint_frequency=0, num_to_keep=0)
This fails. Similar (1). -
air.CheckpointConfig(checkpoint_frequency=0, num_to_keep=None)
This runs but still creates all log directories like (2)
I am curious, has this worked for you ? And do you know if Ray Tune has custom logging and checkpointing extensions?
@max_ronda Thanks for checking this! So, on my side it works, but could it be that you still have the checkpoint_at_end=True
set? This attribute is defaulted to True
for Trainer
s that support it. Custom Trainer
s or training functions might not support it at all.
Ray Tune
does have custom logging. Take a look here and see, if this helps you.
@Lars_Simon_Zehnder , I tried what you suggested but still no luck completely ignoring checkpointing. This is the code I am playing with if you’d like to test it out:
from ray import tune
from ray.air import session
from ray import air
import ray
import numpy as np
from ray.tune.search.optuna import OptunaSearch
import optuna
def problem_rosenbrock(config):
x = config["x"]
y = config["y"]
z = (1 - x) ** 2 + 100 * (y - x**2) ** 2
session.report({"mean_loss": z})
ray.init(num_cpus=5, ignore_reinit_error=True, log_to_driver=False)
search_space = {
"x": tune.uniform(-2, 2),
"y": tune.uniform(-1, 4),
}
algo = OptunaSearch(mode="min")
tuner = tune.Tuner(
problem_rosenbrock,
tune_config=tune.TuneConfig(
metric="mean_loss",
mode="min",
search_alg=algo,
num_samples=1000,
),
param_space=search_space,
run_config=air.RunConfig(
checkpoint_config=air.CheckpointConfig(
checkpoint_at_end=False,
checkpoint_frequency=0,
num_to_keep=1,
),
local_dir=log_dir,
),
)
results = tuner.fit()
Running about 1000 trials on native Optuna
takes just a few seconds compared to the ~50secs it takes to run within Ray Tune. My goal was to at least match the speeds of the native optimizer within Ray Tune but no luck so far. I do attribute it to checkpointing but is there anything else I should be considering ?
Let me know if you have other suggestions. And thanks for pointing me to custom logging
hi @max_ronda , non-Ray Tune expert here but what if you do
checkpoint_config=air.CheckpointConfig(
checkpoint_at_end=False,
checkpoint_frequency=0,
num_to_keep=0,
)
I checked the implementation of CheckpointConfig config.py - ray-project/ray - Sourcegraph that says
checkpoint_at_end: If True, will save a checkpoint at the end of training.
checkpoint_frequency: Number of iterations between checkpoints.
If 0 this will disable checkpointing.
num_to_keep: The number of checkpoints to keep on disk for this run.
If this is ``0`` then no checkpoints will be persisted to disk.
With num_to_keep=1
I think this still implies we expect to have one checkpoint at the end.
@max_ronda I can reproduce your results. Running the experiment directly in in optuna
takes around 7.5s on my 4 CPUs. Running the same experiment in tune
takes around 50.9s.
I ran in addition an experiment with no logging by setting
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
tuner = tune.Tuner(
problem_rosenbrock,
tune_config=tune.TuneConfig(
metric="mean_loss",
mode="min",
search_alg=algo,
num_samples=1000,
),
param_space=search_space,
run_config=air.RunConfig(
checkpoint_config=air.CheckpointConfig(
checkpoint_at_end=False,
checkpoint_frequency=0,
num_to_keep=1,
),
local_dir="log_dir",
log_to_file=False,
verbose=1,
),
)
and got to around 24.5s (so around half of the time before). This is still signficantly slower than in optuna
.
We have to not forget that ray
and its ecosystem is created to run large workloads on any hardware. This comes of course with a price on smal workloads on small hardware. The tune
developer team has benchmarked scalability and overhead. You can also run benchmarking yourself on your machine using the linked files in there. I ran the bookkeeping test using
ray.init(num_cpus=4)
num_samples = 1000
results_per_second = 1
trial_length_s = 1
max_runtime = 250
It took around 16.6s longer than the theoretical value. Relating back to the Optuna script you posted and I ran: the 16.6s + 7.5s brings me close to the 24.5s.
Concluding, I would say the Rosenbrock example is maybe not an example that really pays out on Ray. If it is about neural network training or something else with large workloads you will see the benefits.
@Jiao_Dong Setting num_to_keep=1
in this case does not produce a checkpoint in any trial folder.
Hi,
I think there is some confusion here around the terminology.
The provided training function problem_rosenbrock
does not save any checkpoints anywhere. It is a user-provided checkpoint function, so as long as you don’t call session.report(metrics, checkpoint=Checkpoint.from_dict({...}))
or so there are no checkpoints being saved.
Thus, the checkpoint settings discussed above have no influence on the outcome at all.
(Just as a side note, the checkpoint_at_end
and checkpoint_frequency
are generally not relevant for function trainables, as the training function itself decides when to store checkpoints. There is no way we can “request” a checkpoint - this only works for class trainables, such as rllib.)
Now back to the problem. I believe what you describe as “checkpointing” is actually just writing the results logs. This can be disabled as @Lars_Simon_Zehnder described. There is also a setting reuse_actors=True
in the TuneConfig, but it is set to True
per default for function trainables, so you should already see that speedup.
Another thing you can do is to use session.report({"mean_loss": z, "done": True})
which will avoid one more round trip in tune processing by indicating that the function finished after the first reported result.
On my machine, this gives a difference of 9 seconds for Optuna vs. 25 seconds for Ray Tune.
Generally, a pure python implementation to run thousands of functions with a millisecond runtime will always be faster than using Ray Tune. The communication overhead with the Ray actors is just too high for this use case.
On my laptop, running 1000 trials on Optuna takes less than 10 seconds, so each trial runs only for about 100ms. In pure python, we only have the function invocation and some minor bookkeeping. In Ray, we have a stateful actor, receiving communication to start the function, report back the result, wait for the next instruction to run the function again (with a new config), etc, as well as some bookkeeping.
So for tiny functions like this, raw python will likely always outperform Ray Tune.
@kai Thanks for these insights! That has also clarified some more things to me (especially the checkpointing for functions and done
in the session.report()
). So around 16s appears to be a good rule of thumb for local machines.
Thank you @kai and @Lars_Simon_Zehnder for clarifying terms and the insightful messages! I tested all the suggested settings and these are the results I got:
Running same experiment:
ray.init(num_cpus=5)
num_samples=1000
trainable_func=problem_rosenbrock()
Test 1 (fastest)
Setting session.report({"mean_loss": z, "done": True})
within trainable
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
reuse_actors=True
log_to_file=False
Time : 22 secs
Test 2
Re-using actors and setting env var
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
reuse_actors=True
log_to_file=False
Time : 28 seconds
Test 3
Logging to file stdout
and stderr
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
reuse_actors=True
log_to_file=True
Time : 32 seconds
Test 4
Not setting env var but re-using actors
reuse_actors=True
log_to_file=False
Time : 42 secs
Test 5 (slowest)
Not re-using actors but setting env var
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
reuse_actors=False
log_to_file=False
Time : ~10 mins
I realize I would not get the same speed as running native Optuna but happy to see there are settings I can tweak to improve performance.
Questions
- @kai , is there a way to disable logging all completely ? Similar to what this user asked.
- Is there a way to customize the trial name and dirname? I see there is a trial_dirname_creator parameter that seems to work within tune.run() call but have not seen an example within
tune.Tuner
api . Are there some examples on how to passtrial_name_creator
andtrial_dirname_creator
for currentTuner
api?
Again, thanks for all the responses ! This has been very helpful
@max_ronda Happy to help. For the trial_name_creator
take a look at this example
Hi Team,
I am currently running into the same issue. I tried configuring the checkpoint configuration as follows. (Ray 2.2.0)
Ray is initialized as ray.init(address="auto", log_to_driver=False)
I tried the suggestion from @Lars_Simon_Zehnder below,
checkpoint_config = air.CheckpointConfig(
num_to_keep=1,
checkpoint_at_end=False,
checkpoint_frequency=0,
)
The above configuration throws the below error:
Traceback (most recent call last):
File "/home/jobuser/build/lipy-drex/environments/satellites/python/lib/python3.10/site-packages/ray/tune/tuner.py", line 272, in fit
return self._local_tuner.fit()
File "/home/jobuser/build/lipy-drex/environments/satellites/python/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 420, in fit
analysis = self._fit_internal(trainable, param_space)
File "/home/jobuser/build/lipy-drex/environments/satellites/python/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 520, in _fit_internal
**self._get_tune_run_arguments(trainable),
File "/home/jobuser/build/lipy-drex/environments/satellites/python/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 469, in _get_tune_run_arguments
raise ValueError(
ValueError: You passed `checkpoint_at_end=False` to your CheckpointConfig, but this trainer does not support this argument. If the trainer takes in a training loop, you will need to trigger checkpointing yourself using `ray.air.session.report(metrics=..., checkpoint=...)`.
I updated the configuration as follows:
sync_config = tune.SyncConfig(
upload_dir=hdfs_upload_dir,
syncer=TunerSync()
)
checkpoint_config = air.CheckpointConfig(
num_to_keep=1,
)
This configuration throws the below error and the status of the Ray job is FAILED
. The Ray checkpoints are logged to HDFS.
2023-04-14 23:54:20,679 ERROR checkpoint_manager.py:137 -- The requested checkpoint is not available on this node, most likely because you are using Ray client or disabled checkpoint synchronization. To avoid this, enable checkpoint synchronization to cloud storage by specifying a `SyncConfig`. The checkpoint may be available on a different node - please check this location on worker nodes: ~/ray_results/test_experiment/HorovodTrainer_34373_00003_3_learning_rate=0.0000,num_train_epochs=5,per_device_eval_batch_size=16,per_device_train_batch_size=16,_2023-04-14_23-51-14/checkpoint_-00001
If I comment out the checkpoint_config, it still throws the checkpoint_manager.py:137
above error, but this time the status of the job is SUCCEEDED
. The Ray checkpoints are logged to HDFS in this case as well.
Can I please get some guidance on how to turn off the Ray checkpointing completely?
And also if I have to limit the checkpoints to 1, the job is failing but the checkpoints are synced though, how can I get past this?
Thank you for your time.
Regards,
Vivek
To disable checkpointing (like not writing to even local disk), that depends on your training function. Could you share what is your training function? If you are using Session
API, this is pretty much just no passing Checkpoint
to Session.report
and only report metrics.
To disable syncing (both to head and to cloud), you can specify
sync_config=SyncConfig(syncer=None)
Thank you for taking a look at my issue.
The training script is as follows:
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
GlueDataset,
GlueDataTrainingArguments,
TrainingArguments
)
from ray.air import session
from ray.tune.examples.pbt_transformers.utils import build_compute_metrics_fn
def train_hf(...):
# ...
# Data handling & logic
# ...
trainer = Trainer(
model_init=get_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=build_compute_metrics_fn(task_name),
)
train_result = trainer.train(resume_from_checkpoint=checkpoint)
eval_metrics = trainer.evaluate(eval_dataset=eval_dataset, metric_key_prefix="eval")
trainer.log_metrics("eval", eval_metrics)
trainer.save_metrics("eval", eval_metrics)
session.report(eval_metrics, checkpoint=checkpoint)
hvd_trainer = HorovodTrainer(
train_loop_per_worker=train_hf,
train_loop_config=train_loop_config,
scaling_config=scaling_config
)
tuner = tune.Tuner(
trainable=hvd_trainer,
tune_config=tune_config,
run_config=run_config,
param_space=param_space,
)
Per your suggestion, I have updated reporting to session.report(eval_metrics)
.
- I still see the
checkpoint_00001
folder under every tune trail. - With or without the below checkpoint config, the
checkpoint_manager.py:137
error still persists.
checkpoint_config = air.CheckpointConfig(
num_to_keep=1,
)
Any pointers will be highly appreciated. Thank you.
do you want to keep at most one checkpoint at any given time or do you want to turn off checkpointing completely?
If you just want it off completely, you should probably try
-
session.report(eval_metrics, checkpoint=checkpoint)
→session.report(eval_metrics)
sync_config=SyncConfig(syncer=None)
- remove
num_to_keep
stuff
Got it, thanks for clarifying on turning off checkpointing. I was trying to understand both configurations as I have multiple use cases.
How about if I would like to keep at most one checkpoint, how do I configure this and get past the below error?
2023-04-14 23:54:20,679 ERROR checkpoint_manager.py:137 -- The requested checkpoint is not available on this node, most likely because you are using Ray client or disabled checkpoint synchronization. To avoid this, enable checkpoint synchronization to cloud storage by specifying a `SyncConfig`. The checkpoint may be available on a different node - please check this location on worker nodes: ~/ray_results/test_experiment/HorovodTrainer_34373_00003_3_learning_rate=0.0000,num_train_epochs=5,per_device_eval_batch_size=16,per_device_train_batch_size=16,_2023-04-14_23-51-14/checkpoint_-00001