Hi, I’ve been trying to run mnist_pytorch_trainable.py with example_full.yaml at GCP, with two minor changes: I set resources_per_trial
to be 2 CPUs (since the yaml file does not contain nodes with 3 CPUs to begin with), and I set resources: {'CPU': 0}
to the head node to force training with the workers only.
Everything goes fine until the workers are instantiated. They are created properly, I can ssh to them, but the log in the dashboard says that they are in the setting_up
stage indefinitely. As a result, no training ever begins.
I did also try to run my own code, and also a very simple_script.py
only to make sure that the issue was not with the script, but could not advance.
# simple_script.py
import time
import ray
from ray import tune
def train_fn(config):
time.sleep(60)
tune.report(mean_loss = 1)
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--ray-address", type=str)
args = parser.parse_args()
ray.init(address=args.ray_address)
tune_config = {'batch_size': tune.grid_search([16, 32, 64])}
analysis = tune.run(train_fn, config = tune_config, resources_per_trial={"cpu": 1})
print("Best config is:", analysis.best_config)
Any help would be much appreciated.