Using ray tune to optimise a function called with subprocess

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi everyone! I’m trying to do a hyperparameter search for a very complex training script that I’ve got. Since I’m not allowed to modify the training script, I’m trying to work around it by using subprocess to call it in my objective function and reading the performance outputs from our existing reporting framework.

My current simplified script for hyperparameter search looks like this:

def objective_function(config):
    args = ["python", "script.py"]

    # Add hyperparameters
    for arg_name, arg_value in config.items():
        args.append(f"--{arg_name}")
        args.append(str(arg_value))

    # Run the training script
    process = subprocess.Popen(
        args, cwd=training_path
    )

    while True:
        metric = read_from_report()
        train.report({"score": metric})
        time.sleep(1)


search_space = {
        "learning-rate": tune.loguniform(1e-5, 1e-1),
        "batch-size": tune.choice([16, 32, 64, 128]),
        "optimizer": tune.choice(['sgd', 'adam']),
    }

asha_scheduler = ASHAScheduler(
    time_attr='training_iteration',
    max_t=100,
)

tune_config = TuneConfig(
    max_concurrent_trials=1,
    num_samples=-1,
    search_alg=HyperOptSearch(),
    mode='max',
    scheduler=asha_scheduler,
)

# Define the hyperparameter search algorithm
trainable = objective_function

trainable_with_gpu = tune.with_resources(trainable, {"gpu": 1, "cpu": 20})

tuner = tune.Tuner(
    trainable_with_gpu, param_space=search_space, tune_config=tune_config
)
results = tuner.fit()
print(results)

The problem that I have has to do with resources and parallelism. My training script is already highly optimized for parallelism and automatically uses all the workers and gpus available. However, when I run ray.Tuner without the tune.with_resources wrapper, the script doesn’t seem to use my gpu. Is this because it runs in some sort of virtual environment? If I specify "gpu": 1 it correctly identifies my gpu and uses it, but it seems to use 0/20 cpus and runs really slowly (not sure what 0 means in this case!), so I have to manually set to "cpu": 20 or whichever number of cpus I have at the moment.

The other problem that I experienced is that my training script seems to randomly haning in the middle of nowhere and for no apparent reason. Initially, I was using subprocess.Popen with stdout=PIPE and I was reading the output with .read, but I realised that this may cause some overflow if the script produces too much text and results in hanging, so I left the script to output everything on console.

Is using subprocess the best approach for optimizing an existing function without having to do any change to it?

Thank you for your help and for the wonderful tool :slight_smile: