Workers not initiating (it seems...)

denmarc · October 27, 2021, 9:11pm

Hi, I’ve been trying to run mnist_pytorch_trainable.py with example_full.yaml at GCP, with two minor changes: I set resources_per_trial to be 2 CPUs (since the yaml file does not contain nodes with 3 CPUs to begin with), and I set resources: {'CPU': 0} to the head node to force training with the workers only.

Everything goes fine until the workers are instantiated. They are created properly, I can ssh to them, but the log in the dashboard says that they are in the setting_up stage indefinitely. As a result, no training ever begins.

I did also try to run my own code, and also a very simple_script.py only to make sure that the issue was not with the script, but could not advance.

# simple_script.py
import time
import ray
from ray import tune

def train_fn(config):
    time.sleep(60)
    tune.report(mean_loss = 1)
    
if __name__ == '__main__':
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--ray-address", type=str)
    args = parser.parse_args()

    ray.init(address=args.ray_address)
    tune_config = {'batch_size': tune.grid_search([16, 32, 64])}
    analysis = tune.run(train_fn, config = tune_config, resources_per_trial={"cpu": 1})
    print("Best config is:", analysis.best_config)

Any help would be much appreciated.

Topic		Replies	Views
Ray (Tune) v2.8 - Instability with workers on GCP Ray Tune	2	137	September 12, 2024
Tune.run() on cluster failing with "'Worker' object has no attribute 'core_worker'" Ray Tune	6	1421	May 11, 2022
ScalingConfig() num_workers not corresponding to training runs? Ray Train	8	759	February 5, 2024
Pytorch dataloader num_workers with ray tune RLlib	2	40	May 6, 2025
Ray rllib tune.run() stuck in running RLlib	2	348	May 24, 2023

Workers not initiating (it seems...)

Related topics