Just wanted to followup that I am seeing the same behaviour for Kubernetes Cluster with Minio S3 when running the default “CartPole-v1” RL example. The state is tiny, couple of MB at most.
import pyarrow.fs
import ray
from ray import tune, train
from ray.rllib.algorithms.ppo import PPOConfig
# kubectl port-forward service/raycluster-kuberay-head-svc 8265:8265
# ray job submit --address http://localhost:8265 --runtime-env-json '{"working_dir": "./"}' -- python rllib_test1.py
if __name__ == "__main__":
ray.init(address="auto")
config = (
PPOConfig()
.environment("CartPole-v1") # Set the environment
.framework("torch") # Use PyTorch framework
.env_runners(num_env_runners=15) # Corrected deprecated usage
.resources(num_gpus=2) # 2 GPUs for the learner
.training(
train_batch_size=4000, # Adjust as needed
num_sgd_iter=10, # Number of SGD iterations per training batch
)
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
)
# S3-compatible filesystem setup
fs = pyarrow.fs.S3FileSystem(
endpoint_override="http://192.168.1.140:30007"
)
# Tuning setup
tuner = tune.Tuner(
"PPO",
param_space=config,
run_config=train.RunConfig(
stop={"env_runners/episode_return_mean": 150.0}, # Corrected metric
storage_filesystem=fs,
storage_path="rayexperiments/runs",
name="experiment_name2",
),
)
tuner.fit()
Dockerfile for head/workers (same for cpu and gpu base)
FROM rayproject/ray:2.40.0
RUN pip install torch torchvision torchaudio
Saving takes up to 160 seconds.
You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
2024-12-25 15:39:12,101 WARNING experiment_state.py:233 -- Saving the experiment state (which holds a global view of trial statuses and is used to restore the experiment) has already taken 158.33 seconds, which may cause consistency issues upon restoration if your driver script ungracefully exits.
This could be due to a large number of trials, large logfiles from lots of reported metrics, or throttling from the remote storage if uploading too frequently.
You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
Trial status: 1 RUNNING
Current time: 2024-12-25 15:39:12. Total running time: 9min 7s
Logical resource usage: 16.0/51 CPUs, 0/2 GPUs (0.0/1.0 accelerator_type:G)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status iter total time (s) ...lls_per_iteration ..._sampled_lifetime │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_CartPole-v1_2b785_00000 RUNNING 9 14.1913 1 36000 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Trial status: 1 RUNNING
Current time: 2024-12-25 15:39:44. Total running time: 9min 39s
Logical resource usage: 16.0/51 CPUs, 0/2 GPUs (0.0/1.0 accelerator_type:G)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status iter total time (s) ...lls_per_iteration ..._sampled_lifetime │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_CartPole-v1_2b785_00000 RUNNING 9 14.1913 1 36000 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
2024-12-25 15:39:54,470 WARNING experiment_state.py:233 -- Saving the experiment state (which holds a global view of trial statuses and is used to restore the experiment) has already taken 32.29 seconds, which may cause consistency issues upon restoration if your driver script ungracefully exits.
This could be due to a large number of trials, large logfiles from lots of reported metrics, or throttling from the remote storage if uploading too frequently.
You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
2024-12-25 15:40:05,662 WARNING experiment_state.py:233 -- Saving the experiment state (which holds a global view of trial statuses and is used to restore the experiment) has already taken 43.49 seconds, which may cause consistency issues upon restoration if your driver script ungracefully exits.
This could be due to a large number of trials, large logfiles from lots of reported metrics, or throttling from the remote storage if uploading too frequently.
You may want to consider switching the `RunConfig(storage_filesystem)` to a more performant storage backend such as s3fs for a S3 storage path.
You can suppress this error by setting the environment variable TUNE_WARN_SLOW_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a higher value than the current threshold (30.0).
Trial status: 1 RUNNING
Current time: 2024-12-25 15:40:14. Total running time: 10min 9s
Logical resource usage: 16.0/51 CPUs, 0/2 GPUs (0.0/1.0 accelerator_type:G)
Let me know if you have problems reproducing it.