Dear Ray community,
I am trying out to train a model with hyperparameters including batch size, layer size, layer channels and so on. Thus my models have varying requirements of GPU memory. In order to speed up things I plan on starting all trials with a logic GPU requirement of 0.4 and monitor the max memory requirement in each run. When reporting to Ray, I also log the memory requirements. I adopted the resource changing scheduler to check whether the memory was lower than what has been allocated.
def resources_allocation_function(tune_controller, trial, result, scheduler):
"""
Adjusts resource allocation for a trial based on GPU utilization.
Args:
tune_controller: Trial runner for this Tune run.
trial: The trial to allocate new resources to.
result: The latest results of the trial.
scheduler: The scheduler calling the function.
"""
if result["training_iteration"] < 1:
return None
base_resources = scheduler._base_trial_resources or PlacementGroupFactory([{"CPU": 1, "GPU": 0}])
current_gpu = base_resources.required_resources.get("GPU", 0)
peak_gpu_memory = result.get("peak_gpu_mem", 0)
total_gpu_memory = result.get("total_gpu_mem", 1)
gpu_utilization = peak_gpu_memory / total_gpu_memory
if gpu_utilization >= 0.75 * current_gpu:
print("Not changing GPU resource allocation")
return None
new_gpu_allocation = min(round(gpu_utilization * 1.2, 3), 1)
print(f"Changing GPU allocation from {current_gpu} to {new_gpu_allocation}")
return PlacementGroupFactory([{"CPU": base_resources.required_resources.get("CPU", 1), "GPU": new_gpu_allocation}])
optuna_search = OptunaSearch(
metric="validation/loss",
mode="min"
)
asha = ASHAScheduler(
time_attr="training_iteration",
max_t=config["EPOCHS"],
metric="validation/loss",
mode="min",
grace_period=config["GRACE_PERIOD"],
reduction_factor=config["REDUCTION_FACTOR"]
)
scheduler = ResourceChangingScheduler(
base_scheduler=asha,
resources_allocation_function=resources_allocation_function,
)
tuner = tune.Tuner(
tune.with_resources(
tune.with_parameters(trainable),
resources={
"cpu": config["CPUS_PER_TRIAL"],
"gpu": config["GPUS_PER_TRIAL"]
}
),
tune_config=tune.TuneConfig(
max_concurrent_trials=config["MAX_CONCURRENT"],
search_alg=optuna_search,
scheduler=scheduler,
num_samples=config["NUM_EXPERIMENTS"],
time_budget_s=3600 * 12 # 12 hours
),
run_config=RunConfig(
storage_path=config["RAY_RESULTS_DIR"],
failure_config=FailureConfig(max_failures=2),
checkpoint_config=CheckpointConfig(
num_to_keep=2,
checkpoint_score_attribute="validation/loss",
checkpoint_score_order="min",
)
),
param_space=config,
)
results = tuner.fit()
The reallocation works but here comes the first problem:
All trials are now being paused (have to be checkpointed and restarted as I understood from the docs). Now i end up with 90% of my started trials being paused until the last one has been started, before the paused trials continue to be executed. I would like to prioritize resuming paused trials over starting new ones. Is there an option to do this?
Now to my second question:
I used this code before to wrap my trainable in order to prevent OOM errors.
if config["DEVICE"] == "cuda":
def tune_func(*args, **kwargs):
tune.utils.wait_for_gpu(
target_util=1-config["GPUS_PER_TRIAL"] # utilization threshold to reach to unblock
)
train_model(*args, **kwargs)
trainable = tune_func
else:
trainable=train_model
Is it possible to change the target_util to be something like „1 - trial._required_gpu_resources“? Otherwise the GPU would not be fully filled with jobs.
Best
Johann