Hi, I want to run a benchmark task with ray.tune
. I implement a very simple logic to run an algorithm with different hyper-parameters and random seeds. The idea of the script is shown below.
SEEDS = [...]
def training_function(config):
setup_seed(config['seed'])
return training(config)
if __name__ == '__main__':
ray.init('auto')
config = {}
grid_tune = ...
for k, v in grid_tune.items():
config[k] = tune.grid_search(v)
config['seed'] = tune.grid_search(SEEDS)
analysis = tune.run(
training_function,
name='benchmark',
config=config,
queue_trials=True,
metric='reward',
mode='max',
resources_per_trial={
"cpu": 1,
"gpu": 0.5,
}
)
upload_result()
In one of my experiments, I am running the algorithms with 16 configurations of hyper-parameters and 3 random seeds. Thus, there are totally 48 trials to run. I have 3 nodes to run the experiment with 4 cards on each node. So the task needs 2 round to complete, with 24 trials per round.
In the first round, everything works smoothly. However, in the second round, the trial raises RuntimeError: No CUDA GPUs are available
. I have tried to sleep a little time (30s) to give ray
more time to clean the resources, but the error still shows up.
Is there anyone knows what causes the problem and how to fix it? Thanks in advance.