I am running a Ray Tune job via a remote Kubernetes cluster. I am using s3
for persistent storage.
The following output was from a run with 8 concurrent trials. The trials were distributed across 4 nodes each with 2 gpus.
(TunerInternal pid=798) Saving the experiment state (which holds a global view of trial statuses and is used to restore the experiment) took ~50.96 seconds, which may be a performance bottleneck.
(TunerInternal pid=798) This could be due to a large number of trials, large logfiles from lots of reported metrics, or throttling from the remote storage if uploading too frequently.
(TunerInternal pid=798) You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
(TunerInternal pid=798) You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
How can I diagnose what exactly the bottleneck here is?