[Tune] How to turn off checkpointing for testing

max_ronda · October 9, 2022, 10:13pm

Hi Tune community, I am testing out Ray Tune on different classical optimization functions but have been slowed down by the constant checkpointing Ray Tune is doing. Is there a way to turn off checkpointing ? Thank you !

Lars_Simon_Zehnder · October 10, 2022, 12:38pm

Hi @max_ronda, what happens if you set in your CheckpointConfigǹum_to_keep=0`? Like this:

from ray import air, tune
tuner = tune.Tuner(
    my_trainable,
    run_config=air.RunConfig(
       checkpoint_config=air.CheckpointConfig(
          num_to_keep=0,
       )
    )

max_ronda · October 10, 2022, 1:49pm

Continuing the discussion from [Tune] How to turn off checkpointing for testing:

Hi @Lars_Simon_Zehnder , I tried that. It looks like it still wants to create a path to log and it fails.

Throws something like this:

TuneError: Tune run failed. Please use tuner = Tuner.restore("/home/ray_results/problem_my_func_2022-10-10_08-42-30") to resume.

This is my code btw. Very simple setup.

    tuner = tune.Tuner(
        my_func,
        tune_config=tune.TuneConfig(
            metric="mean_loss",
            mode="min",
            search_alg=algo,
            num_samples=num_samples
        ),
        param_space=search_space,
        run_config=air.RunConfig(
        checkpoint_config=air.CheckpointConfig(num_to_keep=0)
            # local_dir=log_dir,
        )
    )

Any other thoughts ? Looks like this is a bug

Lars_Simon_Zehnder · October 10, 2022, 2:09pm

@max_ronda, interesting. Could you also try out to use instead num_to_keep=0 checkpoint_frequency=0?

max_ronda · October 10, 2022, 5:42pm

Hi @Lars_Simon_Zehnder , I tried that and framework went ahead and still created a log path under ~/ray_results/. I also tried both:

checkpoint_config=air.CheckpointConfig(checkpoint_frequency=0, num_to_keep=0)

And that failed because of num_to_keep=0, I assume.

Any other thoughts? Curious if that works for you ?

Thanks!

Lars_Simon_Zehnder · October 10, 2022, 5:53pm

@max_ronda I might have been a little imprecise. I meant leaving out the num_to_keep argument and instead using the checkpoint_frequency argument.

But could you control, if there are still checkpoints under the log path. It is normal that Ray creates the log path as it also logs error files and tensorboard event files into it.

max_ronda · October 10, 2022, 7:37pm

Hi @Lars_Simon_Zehnder , apologies, I mean to say I tried both methods. But now I’ve tried all permutations:

air.CheckpointConfig(num_to_keep=0)
This fails. Wants to create a log dir within /ray_results/ but throws errors about wanting to restore from checkpoint.
air.CheckpointConfig(checkpoint_frequency=0)
This runs but still creates log directories and checkpoints for each trial.
air.CheckpointConfig(checkpoint_frequency=0, num_to_keep=0)
This fails. Similar (1).
air.CheckpointConfig(checkpoint_frequency=0, num_to_keep=None)
This runs but still creates all log directories like (2)

I am curious, has this worked for you ? And do you know if Ray Tune has custom logging and checkpointing extensions?

Lars_Simon_Zehnder · October 14, 2022, 8:00am

@max_ronda Thanks for checking this! So, on my side it works, but could it be that you still have the checkpoint_at_end=True set? This attribute is defaulted to True for Trainers that support it. Custom Trainers or training functions might not support it at all.

Ray Tune does have custom logging. Take a look here and see, if this helps you.

max_ronda · October 14, 2022, 9:15pm

@Lars_Simon_Zehnder , I tried what you suggested but still no luck completely ignoring checkpointing. This is the code I am playing with if you’d like to test it out:

from ray import tune
from ray.air import session
from ray import air
import ray
import numpy as np

from ray.tune.search.optuna import OptunaSearch
import optuna

def problem_rosenbrock(config):

    x = config["x"]
    y = config["y"]

    z = (1 - x) ** 2 + 100 * (y - x**2) ** 2

    session.report({"mean_loss": z})

ray.init(num_cpus=5, ignore_reinit_error=True, log_to_driver=False)

search_space = {
    "x": tune.uniform(-2, 2),
    "y": tune.uniform(-1, 4),
}

algo = OptunaSearch(mode="min")

tuner = tune.Tuner(
    problem_rosenbrock,
    tune_config=tune.TuneConfig(
        metric="mean_loss",
        mode="min",
        search_alg=algo,
        num_samples=1000,
    ),
    param_space=search_space,
    run_config=air.RunConfig(
        checkpoint_config=air.CheckpointConfig(
            checkpoint_at_end=False,
            checkpoint_frequency=0,
            num_to_keep=1,
        ),
        local_dir=log_dir,
    ),
)

results = tuner.fit()

Running about 1000 trials on native Optuna takes just a few seconds compared to the ~50secs it takes to run within Ray Tune. My goal was to at least match the speeds of the native optimizer within Ray Tune but no luck so far. I do attribute it to checkpointing but is there anything else I should be considering ?

Let me know if you have other suggestions. And thanks for pointing me to custom logging

Jiao_Dong · October 15, 2022, 6:40pm

hi @max_ronda , non-Ray Tune expert here but what if you do

checkpoint_config=air.CheckpointConfig(
    checkpoint_at_end=False,
    checkpoint_frequency=0,
    num_to_keep=0,
)

I checked the implementation of CheckpointConfig config.py - ray-project/ray - Sourcegraph that says

checkpoint_at_end: If True, will save a checkpoint at the end of training.
checkpoint_frequency: Number of iterations between checkpoints. 
  If 0 this will disable checkpointing.
num_to_keep: The number of checkpoints to keep on disk for this run.  
  If this is ``0`` then no checkpoints will be persisted to disk.

With num_to_keep=1 I think this still implies we expect to have one checkpoint at the end.

Lars_Simon_Zehnder · October 16, 2022, 7:47pm

@max_ronda I can reproduce your results. Running the experiment directly in in optuna takes around 7.5s on my 4 CPUs. Running the same experiment in tune takes around 50.9s.

I ran in addition an experiment with no logging by setting

os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"

tuner = tune.Tuner(
    problem_rosenbrock,
    tune_config=tune.TuneConfig(
        metric="mean_loss",
        mode="min",
        search_alg=algo,
        num_samples=1000,
    ),
    param_space=search_space,
    run_config=air.RunConfig(
        checkpoint_config=air.CheckpointConfig(
            checkpoint_at_end=False,
            checkpoint_frequency=0,
            num_to_keep=1,
        ),
        local_dir="log_dir",
        log_to_file=False,
        verbose=1,
    ),
)

and got to around 24.5s (so around half of the time before). This is still signficantly slower than in optuna.

We have to not forget that ray and its ecosystem is created to run large workloads on any hardware. This comes of course with a price on smal workloads on small hardware. The tune developer team has benchmarked scalability and overhead. You can also run benchmarking yourself on your machine using the linked files in there. I ran the bookkeeping test using

ray.init(num_cpus=4)
num_samples = 1000
results_per_second = 1
trial_length_s = 1
max_runtime = 250

It took around 16.6s longer than the theoretical value. Relating back to the Optuna script you posted and I ran: the 16.6s + 7.5s brings me close to the 24.5s.

Concluding, I would say the Rosenbrock example is maybe not an example that really pays out on Ray. If it is about neural network training or something else with large workloads you will see the benefits.

@Jiao_Dong Setting num_to_keep=1 in this case does not produce a checkpoint in any trial folder.

kai · October 16, 2022, 11:44pm

Hi,

I think there is some confusion here around the terminology.

The provided training function problem_rosenbrock does not save any checkpoints anywhere. It is a user-provided checkpoint function, so as long as you don’t call session.report(metrics, checkpoint=Checkpoint.from_dict({...})) or so there are no checkpoints being saved.

Thus, the checkpoint settings discussed above have no influence on the outcome at all.

(Just as a side note, the checkpoint_at_end and checkpoint_frequency are generally not relevant for function trainables, as the training function itself decides when to store checkpoints. There is no way we can “request” a checkpoint - this only works for class trainables, such as rllib.)

Now back to the problem. I believe what you describe as “checkpointing” is actually just writing the results logs. This can be disabled as @Lars_Simon_Zehnder described. There is also a setting reuse_actors=True in the TuneConfig, but it is set to True per default for function trainables, so you should already see that speedup.

Another thing you can do is to use session.report({"mean_loss": z, "done": True}) which will avoid one more round trip in tune processing by indicating that the function finished after the first reported result.

On my machine, this gives a difference of 9 seconds for Optuna vs. 25 seconds for Ray Tune.

Generally, a pure python implementation to run thousands of functions with a millisecond runtime will always be faster than using Ray Tune. The communication overhead with the Ray actors is just too high for this use case.

On my laptop, running 1000 trials on Optuna takes less than 10 seconds, so each trial runs only for about 100ms. In pure python, we only have the function invocation and some minor bookkeeping. In Ray, we have a stateful actor, receiving communication to start the function, report back the result, wait for the next instruction to run the function again (with a new config), etc, as well as some bookkeeping.

So for tiny functions like this, raw python will likely always outperform Ray Tune.

Lars_Simon_Zehnder · October 17, 2022, 8:27am

@kai Thanks for these insights! That has also clarified some more things to me (especially the checkpointing for functions and done in the session.report()). So around 16s appears to be a good rule of thumb for local machines.

max_ronda · October 18, 2022, 2:29pm

Thank you @kai and @Lars_Simon_Zehnder for clarifying terms and the insightful messages! I tested all the suggested settings and these are the results I got:

Running same experiment:

ray.init(num_cpus=5)
num_samples=1000
trainable_func=problem_rosenbrock()

Test 1 (fastest)

Setting session.report({"mean_loss": z, "done": True}) within trainable

os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
reuse_actors=True
log_to_file=False

Time : 22 secs

Test 2

Re-using actors and setting env var

os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
reuse_actors=True
log_to_file=False

Time : 28 seconds

Test 3

Logging to file stdout and stderr

os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
reuse_actors=True
log_to_file=True

Time : 32 seconds

Test 4

Not setting env var but re-using actors

reuse_actors=True
log_to_file=False

Time : 42 secs

Test 5 (slowest)

Not re-using actors but setting env var

os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
reuse_actors=False
log_to_file=False

Time : ~10 mins

I realize I would not get the same speed as running native Optuna but happy to see there are settings I can tweak to improve performance.

Questions

@kai , is there a way to disable logging all completely ? Similar to what this user asked.
Is there a way to customize the trial name and dirname? I see there is a trial_dirname_creator parameter that seems to work within tune.run() call but have not seen an example within tune.Tuner api . Are there some examples on how to pass trial_name_creator and trial_dirname_creator for current Tuner api?

Again, thanks for all the responses ! This has been very helpful

Lars_Simon_Zehnder · October 18, 2022, 3:15pm

@max_ronda Happy to help. For the trial_name_creator take a look at this example

saivivek15 · April 15, 2023, 12:50am

Hi Team,

I am currently running into the same issue. I tried configuring the checkpoint configuration as follows. (Ray 2.2.0)

Ray is initialized as ray.init(address="auto", log_to_driver=False)

I tried the suggestion from @Lars_Simon_Zehnder below,

    checkpoint_config = air.CheckpointConfig(
        num_to_keep=1,
        checkpoint_at_end=False,
        checkpoint_frequency=0,
    )

The above configuration throws the below error:

Traceback (most recent call last):
  File "/home/jobuser/build/lipy-drex/environments/satellites/python/lib/python3.10/site-packages/ray/tune/tuner.py", line 272, in fit
    return self._local_tuner.fit()
  File "/home/jobuser/build/lipy-drex/environments/satellites/python/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 420, in fit
    analysis = self._fit_internal(trainable, param_space)
  File "/home/jobuser/build/lipy-drex/environments/satellites/python/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 520, in _fit_internal
    **self._get_tune_run_arguments(trainable),
  File "/home/jobuser/build/lipy-drex/environments/satellites/python/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 469, in _get_tune_run_arguments
    raise ValueError(
ValueError: You passed `checkpoint_at_end=False` to your CheckpointConfig, but this trainer does not support this argument. If the trainer takes in a training loop, you will need to trigger checkpointing yourself using `ray.air.session.report(metrics=..., checkpoint=...)`.

I updated the configuration as follows:

sync_config = tune.SyncConfig(
        upload_dir=hdfs_upload_dir,
        syncer=TunerSync()
    )
checkpoint_config = air.CheckpointConfig(
        num_to_keep=1,
    )

This configuration throws the below error and the status of the Ray job is FAILED. The Ray checkpoints are logged to HDFS.

2023-04-14 23:54:20,679	ERROR checkpoint_manager.py:137 -- The requested checkpoint is not available on this node, most likely because you are using Ray client or disabled checkpoint synchronization. To avoid this, enable checkpoint synchronization to cloud storage by specifying a `SyncConfig`. The checkpoint may be available on a different node - please check this location on worker nodes: ~/ray_results/test_experiment/HorovodTrainer_34373_00003_3_learning_rate=0.0000,num_train_epochs=5,per_device_eval_batch_size=16,per_device_train_batch_size=16,_2023-04-14_23-51-14/checkpoint_-00001

If I comment out the checkpoint_config, it still throws the checkpoint_manager.py:137 above error, but this time the status of the job is SUCCEEDED. The Ray checkpoints are logged to HDFS in this case as well.

Can I please get some guidance on how to turn off the Ray checkpointing completely?

And also if I have to limit the checkpoints to 1, the job is failing but the checkpoints are synced though, how can I get past this?

Thank you for your time.

Regards,
Vivek

xwjiang2010 · April 17, 2023, 4:15pm

To disable checkpointing (like not writing to even local disk), that depends on your training function. Could you share what is your training function? If you are using Session API, this is pretty much just no passing Checkpoint to Session.report and only report metrics.

To disable syncing (both to head and to cloud), you can specify

sync_config=SyncConfig(syncer=None)

saivivek15 · April 17, 2023, 10:12pm

Thank you for taking a look at my issue.

The training script is as follows:

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    GlueDataset,
    GlueDataTrainingArguments,
    TrainingArguments
)
from ray.air import session
from ray.tune.examples.pbt_transformers.utils import build_compute_metrics_fn

def train_hf(...):
    # ...
    # Data handling & logic
    # ...
    
    trainer = Trainer(
            model_init=get_model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            compute_metrics=build_compute_metrics_fn(task_name),
        )
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
    eval_metrics = trainer.evaluate(eval_dataset=eval_dataset, metric_key_prefix="eval")
    trainer.log_metrics("eval", eval_metrics)
    trainer.save_metrics("eval", eval_metrics)
    session.report(eval_metrics, checkpoint=checkpoint)


hvd_trainer = HorovodTrainer(
                train_loop_per_worker=train_hf,
                train_loop_config=train_loop_config,
                scaling_config=scaling_config
            )

tuner = tune.Tuner(
            trainable=hvd_trainer,
            tune_config=tune_config,
            run_config=run_config,
            param_space=param_space,
        )

Per your suggestion, I have updated reporting to session.report(eval_metrics).

I still see the checkpoint_00001 folder under every tune trail.
With or without the below checkpoint config, the checkpoint_manager.py:137 error still persists.

checkpoint_config = air.CheckpointConfig(
        num_to_keep=1,
 )

Any pointers will be highly appreciated. Thank you.

xwjiang2010 · April 17, 2023, 10:39pm

do you want to keep at most one checkpoint at any given time or do you want to turn off checkpointing completely?
If you just want it off completely, you should probably try

session.report(eval_metrics, checkpoint=checkpoint) → session.report(eval_metrics)
sync_config=SyncConfig(syncer=None)
remove num_to_keep stuff

saivivek15 · April 17, 2023, 10:47pm

Got it, thanks for clarifying on turning off checkpointing. I was trying to understand both configurations as I have multiple use cases.

How about if I would like to keep at most one checkpoint, how do I configure this and get past the below error?

2023-04-14 23:54:20,679	ERROR checkpoint_manager.py:137 -- The requested checkpoint is not available on this node, most likely because you are using Ray client or disabled checkpoint synchronization. To avoid this, enable checkpoint synchronization to cloud storage by specifying a `SyncConfig`. The checkpoint may be available on a different node - please check this location on worker nodes: ~/ray_results/test_experiment/HorovodTrainer_34373_00003_3_learning_rate=0.0000,num_train_epochs=5,per_device_eval_batch_size=16,per_device_train_batch_size=16,_2023-04-14_23-51-14/checkpoint_-00001

Topic		Replies	Views
Ray Tune on GCP cluster: checkpoint not found after successful sync down Ray Tune	10	1195	April 22, 2021
Ray Tune event loop backlogged, slow with checkpointing Ray Tune	7	1639	September 28, 2021
Checkpointing with distributed training Ray Tune	14	875	April 20, 2021
Cannot find checkpoint when gpus_per_trial > 0 Ray Tune	8	637	February 28, 2023
Tune Performance Bottlenecks Ray Tune	8	3630	February 8, 2021