Workers not initiating (it seems...)

Hi, I’ve been trying to run mnist_pytorch_trainable.py with example_full.yaml at GCP, with two minor changes: I set resources_per_trial to be 2 CPUs (since the yaml file does not contain nodes with 3 CPUs to begin with), and I set resources: {'CPU': 0} to the head node to force training with the workers only.

Everything goes fine until the workers are instantiated. They are created properly, I can ssh to them, but the log in the dashboard says that they are in the setting_up stage indefinitely. As a result, no training ever begins.

I did also try to run my own code, and also a very simple_script.py only to make sure that the issue was not with the script, but could not advance.

# simple_script.py
import time
import ray
from ray import tune

def train_fn(config):
    time.sleep(60)
    tune.report(mean_loss = 1)
    
if __name__ == '__main__':
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--ray-address", type=str)
    args = parser.parse_args()

    ray.init(address=args.ray_address)
    tune_config = {'batch_size': tune.grid_search([16, 32, 64])}
    analysis = tune.run(train_fn, config = tune_config, resources_per_trial={"cpu": 1})
    print("Best config is:", analysis.best_config)

Any help would be much appreciated.