Repeatedly getting GCS timeout

sivanni · March 29, 2023, 2:05pm

I am repeatedly getting the error below, and I could not find in the documentation how to programmatically increase the timeout period. Docs say that timeout is configured when I use TuneConfig parameter time_budget_s, but does not say what I can do about it. Help.

System is Ubuntu Linux 22.04, python 10.8, Ray 2.2.0, Optuna 3.1.0, and I am using ASHAScheduler

Below is
1: The PYTHON METHOD for building ray tuner
2: The ERROR MESSAGE after 500-1600 successful terminations

1: PYTHON METHOD
def _set_ray_tuner(self, grid_search=False):
“”“Set ray tuner”“”

    # List of strings from the self.search_space dictionary which should be reported.
    # Include only the parameters which have more than one item listed in the search space.
    parameters_to_report = []
    for key, value in self.search_space.items():
        if key == "num_models":
            continue
        if len(value) > 1:
            parameters_to_report.append(key)

    print(f"parameters_to_report: {parameters_to_report}")
    reporter = CLIReporter(
        metric_columns=[
            "time_total_s",
            "iteration",
            "train_loss",
            "val_loss",
            "mse",
            "ssim",
            "kid_mean",
            "kid_std",
        ],
        parameter_columns=parameters_to_report,
    )

    trainable = tune.with_resources(TrainableVAE, {"gpu": self.gpu_fraction})
    trainable_with_parameters = tune.with_parameters(
        trainable,
        data_dict={
            "train_data": self.train_data,
            "train_labels": self.train_labels,
            "val_data": self.val_data,
            "val_labels": self.val_labels,
        },
        device=self.device,
        methods={
            "_train_epoch": self._train_epoch,
            "_validate_epoch": self._validate_epoch,
            "_augment_and_get_dataloader": self._augment_and_get_dataloader,
        },
    )

    if grid_search:
        param_space = {
            "lr": tune.grid_search(self.search_space["lr"]),
            "latent_dim": tune.grid_search(self.search_space["latent_dim"]),
            "ksp": tune.grid_search(self.search_space["ksp"]),
            "channels": tune.grid_search(self.search_space["channels"]),
            "batch_size": tune.grid_search(self.search_space["batch_size"]),
            "conv_layers": tune.grid_search(self.search_space["conv_layers"]),
            "batch_norm": tune.grid_search(self.search_space["batch_norm"]),
            "rotation": tune.grid_search(self.search_space["rotation"]),
            "translation": tune.grid_search(self.search_space["translation"]),
            "noise": tune.grid_search(self.search_space["noise"]),
            "model_id": tune.grid_search(
                [
                    "model_{}".format(i)
                    for i in range(self.search_space["num_models"])
                ]
            ),
        }

        # Efficient hyperparameter selection. Search Algorithms are wrappers around open-source
        # optimization libraries. Each library has a
        # specific way of defining the search space.
        # https://docs.ray.io/en/latest/ray-air/package-ref.html#ray.tune.tune_config.TuneConfig
        tune_config = tune.TuneConfig(
            search_alg=tune.search.basic_variant.BasicVariantGenerator(
                constant_grid_search=True,
            ),
        )
    else:

        initial_params = [
            {
                "lr": 0.0003,
                "latent_dim": 2,
                "ksp": "k7s1",
                "channels": 16,
                "batch_size": 64,
                "conv_layers": 3,
                "batch_norm": False,
                "rotation": 0,
                "translation": 0,
                "noise": 0.02,
                "model_id": "model_0",
            }
        ]

        # tune (log)uniform etc require two positional arguments, so we need to unpack the list
        param_space = {
            "lr": tune.loguniform(
                self.search_space["lr"][0], self.search_space["lr"][-1]
            ),
            "latent_dim": tune.choice(self.search_space["latent_dim"]),
            "ksp": tune.choice(self.search_space["ksp"]),
            "channels": tune.choice(self.search_space["channels"]),
            "batch_size": tune.choice(self.search_space["batch_size"]),
            "conv_layers": tune.choice(self.search_space["conv_layers"]),
            "batch_norm": tune.choice(self.search_space["batch_norm"]),
            "rotation": tune.uniform(
                self.search_space["rotation"][0], self.search_space["rotation"][-1]
            ),
            "translation": tune.uniform(
                self.search_space["translation"][0],
                self.search_space["translation"][-1],
            ),
            "noise": tune.uniform(
                self.search_space["noise"][0], self.search_space["noise"][-1]
            ),
            "model_id": tune.choice(
                [
                    "model_{}".format(i)
                    for i in range(self.search_space["num_models"])
                ]
            ),
        }

        # Efficient hyperparameter selection. Search Algorithms are wrappers around open-source
        # optimization libraries. Each library has a
        # specific way of defining the search space.
        # https://docs.ray.io/en/latest/ray-air/package-ref.html#ray.tune.tune_config.TuneConfig
        tune_config = tune.TuneConfig(
            # Local optuna search will generate study name "optuna" indicating in-memory storage
            search_alg=OptunaSearch(
                sampler=TPESampler(),
                metric=self.multi_objective["metric"],
                mode=self.multi_objective["mode"],
                points_to_evaluate=initial_params,
            ),
            scheduler=ASHAScheduler(
                time_attr="training_iteration",
                metric=self.multi_objective["metric"][
                    0
                ],  # Only 1st metric used for pruning
                mode=self.multi_objective["mode"][0],
                max_t=self.epochs,
                grace_period=50,
                reduction_factor=2,
            ),
            time_budget_s=self.time_budget,
            num_samples=-1,
        )

    # Runtime configuration that is specific to individual trials. Will overwrite the run config passed to the Trainer.
    # for API, see https://docs.ray.io/en/latest/ray-air/package-ref.html#ray.air.config.RunConfig
    run_config = (
        air.RunConfig(
            stop={"training_iteration": self.epochs},
            progress_reporter=reporter,
            local_dir=self.ray_dir,
            # callbacks=[MyCallback()],
            checkpoint_config=air.CheckpointConfig(
                checkpoint_score_attribute=self.multi_objective["metric"][0],
                checkpoint_score_order=self.multi_objective["mode"][0],
                num_to_keep=1,
                checkpoint_at_end=False,
                checkpoint_frequency=0,
            ),
            verbose=1,
        ),
    )

    tuner = tune.Tuner(
        trainable_with_parameters,
        param_space=param_space,
        run_config=run_config[0],
        tune_config=tune_config,
    )

    return tuner

2: ERROR MESSAGE
[2023-03-29 13:21:11,915 C 2107118 2107754] gcs_rpc_client.h:537: Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds
*** StackTrace Information ***
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(+0xce4b2a) [0x7fd2c41a8b2a] ray::operator<<()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(+0xce6612) [0x7fd2c41aa612] ray::SpdLogMessage::Flush()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x7fd2c41aa927] ray::RayLog::~RayLog()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(+0x73b46d) [0x7fd2c3bff46d] ray::rpc::GcsRpcClient::CheckChannelStatus()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(_ZN5boost4asio6detail12wait_handlerIZN3ray3rpc12GcsRpcClient15SetupCheckTimerEvEUlNS_6system10error_codeEE_NS0_9execution12any_executorIJNS9_12context_as_tIRNS0_17execution_contextEEENS9_6detail8blocking7never_tILi0EEENS9_11prefer_onlyINSG_10possibly_tILi0EEEEENSJ_INSF_16outstanding_work9tracked_tILi0EEEEENSJ_INSN_11untracked_tILi0EEEEENSJ_INSF_12relationship6fork_tILi0EEEEENSJ_INSU_14continuation_tILi0EEEEEEEEE11do_completeEPvPNS1_19scheduler_operationERKS7_m+0x303) [0x7fd2c3bff913] boost::asio::detail::wait_handler<>::do_complete()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(+0xcf57bb) [0x7fd2c41b97bb] boost::asio::detail::scheduler::do_run_one()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(+0xcf69f1) [0x7fd2c41ba9f1] boost::asio::detail::scheduler::run()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(+0xcf6c60) [0x7fd2c41bac60] boost::asio::io_context::run()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7fd2c3ae0fed] ray::core::CoreWorker::RunIOService()
/opt2/software/miniconda3/envs/ret_pt/lib/python3.10/site-packages/ray/_raylet.so(+0xe2aa10) [0x7fd2c42eea10] execute_native_thread_routine
/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fd4b7267b43]
/lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fd4b72f9a00]

matthewdeng · March 31, 2023, 4:56am

Hmm the error message indicates that something failed with the Ray cluster rather than the Tuner application itself. How long into the script does it take for this error to occur? Are you running on a single node or a multiple node cluster?

sivanni · March 31, 2023, 12:42pm

It is a single node, running from few hours to about 1 day before crash. I kept reducing the complexity of the model, increasing the GPU resource/process with little help.

Cordially,
Simo

matthewdeng · April 2, 2023, 9:55pm

Hmm, I think this means the GCS may have crashed. By any chance do you know if your machine became memory/communication overloaded?

@ClarenceNg any advice on how to debug this? I found this ticket with a related error.

sivanni · April 20, 2023, 7:01am

Hi, I am still consistently getting the same error. RAM is not an issue (some 10% of 128G used), GPU memory might, although I am now running with a single process at a time. What is a communation overload in single linux desktop? Btw RAM and GPU?

I have re-installed the environment, and the latest (below) came with grid search (no Optuna or AHSA). I wonder if custom function calls (numpy-based data augmentation during image processing) might be problematic? This probably causes communication btw cpu and gpu processing.

[2023-04-19 01:56:06,318 E 1260317 1260891] gcs_rpc_client.h:533: Failed to connect to GCS within 60 seconds. GCS may have been killed. It’s either GCS is terminated by ray stop or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. Logging — Ray 3.0.0.dev0. The program will terminate.

Link to compressed last session log:
https://filesender.funet.fi/?s=download&token=539a85f9-c8a8-499d-af4c-870fee653aae

Butanium · July 13, 2023, 3:54pm

Getting the same error on some tasks, stable baseline 3 DQN RL experiment with several models.


@ray.remote
def train_model(aspiration, make_env, log_path, learning_steps, tb_logger):
    env = make_env()
    if aspiration is None:
        model = DQN("MlpPolicy", env, learning_starts=1000)
        model.set_logger(tb_logger("DQN"))
        model.learn(learning_steps, callback=[DQNCallback()])
        model.save(path.join(log_path, "DQN", "models", str(learning_steps)))
    else:
        # ARDQN is a custom version of DQN but it's almost the same
        model = ARDQN("MlpPolicy", env, aspiration)
        model.set_logger(tb_logger(path.join("ARDQN", str(round(aspiration, 2)))))
        model.learn(learning_steps)
        model.save(path.join(log_path, "ARDQN", str(round(aspiration, 2)), "models", str(learning_steps)))

ray_models = [train_model.remote(a, make_env, log_path, LEARNING_STEPS, tb_logger) for a in aspirations] + (
    [train_model.remote(None, make_env, log_path, LEARNING_STEPS, tb_logger)] if USE_DQN else []
)
try:
    ray.get(ray_models)
finally:
    ray.shutdown()

Ray version : 2.5.1
OS: Ubuntu 22.04
GCS log: Ray bug - Pastebin.com

MakGulati · November 29, 2023, 3:19pm

what is the solution here?
i was running hparam sweeps on GPU
i got this error but sweeps perfectly fine on CPU only

[2023-11-28 10:54:57,687 E 1643667 1646813] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It’s either GCS is terminated by ray stop or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. Configuring Logging — Ray 3.0.0.dev0. The program will terminate.

Adiku · March 21, 2024, 7:26am

Is there a solution of this issue?

in my case , in django environment, some basic parallel tasks are done using ray, and after tasks are completed, in about 2 - 3 mins, django crashes with same issue…

[2024-03-21 07:18:02,128 E 8478 9840] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It’s either GCS is terminated by ray stop or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. Configuring Logging — Ray 3.0.0.dev0. The program will terminate.

here is library versions…

django 4.1 pypi_0 pypi
django-allauth 0.50.0 pypi_0 pypi
django-cors-headers 4.3.1 pypi_0 pypi
django-sslserver-v2 1.0 pypi_0 pypi
djangorestframework 3.15.0 pypi_0 pypi
grpcio 1.62.1 pypi_0 pypi
grpcio-status 1.62.1 pypi_0 pypi
ray 2.9.3 pypi_0 pypi

Elavarasan.M · April 2, 2025, 6:08pm

gcs_rpc_client.h:664: Failed to connect to GCS within 60 seconds. GCS may have been killed. It’s either GCS is terminated byray stopor is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. Configuring Logging — Ray 3.0.0.dev0 directory-structure. The program will terminate.

In my case,
Inside concurrent.futures.ThreadPoolExecutor context manager, I execute a function using ThreadPoolExecutor's submit() method. While executing that function, ray initialised ray.init() first time in my program. Then within 60 seconds, ray stops.
If ray was initialised before reaching concurrent.futures.ThreadPoolExecutor codes, it doesn’t stops.
I think initialising ray inside thread seems to be a problem here.
But don’t know how and why. Can anyone explain ?

Topic		Replies	Views
Caught sync error: Sync process failed: Connect timeout on endpoint URL	1	342	October 26, 2023
How to use time_total_s as a stop condition? Ray Tune	2	584	May 26, 2022
Tuner.fit() never terminates Ray Tune	4	373	January 23, 2025
Scheduler early stopping causes resource lock when starting next trial for DDP Ray Tune	0	12	April 7, 2025
How to deal with TuneError: ('Trials did not complete',...) Ray Tune	1	2107	October 6, 2021

Repeatedly getting GCS timeout

Related topics