Ray Tune Sync Threshold Bottleneck

EthanMarx · October 29, 2024, 1:54am

I am running a Ray Tune job via a remote Kubernetes cluster. I am using s3 for persistent storage.
The following output was from a run with 8 concurrent trials. The trials were distributed across 4 nodes each with 2 gpus.

(TunerInternal pid=798) Saving the experiment state (which holds a global view of trial statuses and is used to restore the experiment) took ~50.96 seconds, which may be a performance bottleneck.

(TunerInternal pid=798) This could be due to a large number of trials, large logfiles from lots of reported metrics, or throttling from the remote storage if uploading too frequently.
(TunerInternal pid=798) You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.

(TunerInternal pid=798) You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).

How can I diagnose what exactly the bottleneck here is?

justinvyu · November 8, 2024, 1:48am

@EthanMarx Could you show what files are showing up in s3? Are there any especially large files in the trial folders?

panda · December 25, 2024, 11:45pm

Just wanted to followup that I am seeing the same behaviour for Kubernetes Cluster with Minio S3 when running the default “CartPole-v1” RL example. The state is tiny, couple of MB at most.

import pyarrow.fs
import ray
from ray import tune, train
from ray.rllib.algorithms.ppo import PPOConfig

# kubectl port-forward service/raycluster-kuberay-head-svc 8265:8265
# ray job submit --address http://localhost:8265     --runtime-env-json '{"working_dir": "./"}'     -- python rllib_test1.py 

if __name__ == "__main__":
    ray.init(address="auto")

    config = (
        PPOConfig()
        .environment("CartPole-v1")  # Set the environment
        .framework("torch")  # Use PyTorch framework
        .env_runners(num_env_runners=15)  # Corrected deprecated usage
        .resources(num_gpus=2)  # 2 GPUs for the learner
        .training(
            train_batch_size=4000,  # Adjust as needed
            num_sgd_iter=10,  # Number of SGD iterations per training batch
        )
        .api_stack(
            enable_rl_module_and_learner=True,
            enable_env_runner_and_connector_v2=True,
        )
    )

    # S3-compatible filesystem setup
    fs = pyarrow.fs.S3FileSystem(
        endpoint_override="http://192.168.1.140:30007"
    )

    # Tuning setup
    tuner = tune.Tuner(
        "PPO",
        param_space=config,
        run_config=train.RunConfig(
            stop={"env_runners/episode_return_mean": 150.0},  # Corrected metric
            storage_filesystem=fs,
            storage_path="rayexperiments/runs",
            name="experiment_name2",
        ),
    )

    tuner.fit()

Dockerfile for head/workers (same for cpu and gpu base)

FROM rayproject/ray:2.40.0

RUN pip install torch torchvision torchaudio

Saving takes up to 160 seconds.

You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
2024-12-25 15:39:12,101 WARNING experiment_state.py:233 -- Saving the experiment state (which holds a global view of trial statuses and is used to restore the experiment) has already taken 158.33 seconds, which may cause consistency issues upon restoration if your driver script ungracefully exits.
This could be due to a large number of trials, large logfiles from lots of reported metrics, or throttling from the remote storage if uploading too frequently.
You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
Trial status: 1 RUNNING
Current time: 2024-12-25 15:39:12. Total running time: 9min 7s
Logical resource usage: 16.0/51 CPUs, 0/2 GPUs (0.0/1.0 accelerator_type:G)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                    status       iter     total time (s)     ...lls_per_iteration     ..._sampled_lifetime │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_CartPole-v1_2b785_00000   RUNNING         9            14.1913                        1                    36000 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Trial status: 1 RUNNING
Current time: 2024-12-25 15:39:44. Total running time: 9min 39s
Logical resource usage: 16.0/51 CPUs, 0/2 GPUs (0.0/1.0 accelerator_type:G)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                    status       iter     total time (s)     ...lls_per_iteration     ..._sampled_lifetime │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_CartPole-v1_2b785_00000   RUNNING         9            14.1913                        1                    36000 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
2024-12-25 15:39:54,470 WARNING experiment_state.py:233 -- Saving the experiment state (which holds a global view of trial statuses and is used to restore the experiment) has already taken 32.29 seconds, which may cause consistency issues upon restoration if your driver script ungracefully exits.
This could be due to a large number of trials, large logfiles from lots of reported metrics, or throttling from the remote storage if uploading too frequently.
You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
2024-12-25 15:40:05,662 WARNING experiment_state.py:233 -- Saving the experiment state (which holds a global view of trial statuses and is used to restore the experiment) has already taken 43.49 seconds, which may cause consistency issues upon restoration if your driver script ungracefully exits.
This could be due to a large number of trials, large logfiles from lots of reported metrics, or throttling from the remote storage if uploading too frequently.
You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
Trial status: 1 RUNNING
Current time: 2024-12-25 15:40:14. Total running time: 10min 9s
Logical resource usage: 16.0/51 CPUs, 0/2 GPUs (0.0/1.0 accelerator_type:G)

Let me know if you have problems reproducing it.

Topic		Replies	Views
Cannot find checkpoint when gpus_per_trial > 0 Ray Tune	8	629	February 28, 2023
Caught sync error: Sync process failed: Connect timeout on endpoint URL	1	342	October 26, 2023
RayTune S3 Access Error Ray Tune	0	228	February 14, 2024
Cannot save checkpoint with Ray Tune Ray Tune	1	427	January 23, 2022
Ray Tune on GCP cluster: checkpoint not found after successful sync down Ray Tune	10	1194	April 22, 2021

Ray Tune Sync Threshold Bottleneck

Related topics