I’ve been trying to run hyper parameter tuning for a Transformer model. However, sometimes, when a trial has finished all of its epochs and is switching to the next trial, I will get this error four times before my terminal crashes:
10:34:24 kernel: [135194.702527] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
I’ve traced the issue to this thread.
I’m wondering if there is anything I can do from a Ray perspective? I’ve tried setting reuse_actors=False/True, but neither seem to work. I can’t really target a specific set of hyper parameters that causes this to occur, only that it usually occurs within the first 3 or 4 trials. I’ve checked the output immediately before it crashes, and everything seems normal.
Here is my code:
cpus_per_trial = 26
gpus_per_trial = 1
num_workers = 1
config = {
'lr': tune.loguniform(5.0e-06, 5.0e-4),
'gamma': tune.uniform(0.90, 0.99),
'layers_nr': tune.randint(1, 6),
'heads_nr': tune.randint(1, 8),
'hidden_nr': tune.randint(256, 4096),
'dropout0': tune.uniform(0.00, 0.5),
'dropout1': tune.uniform(0.00, 0.3),
'epochs': args.epochs,
'batch_size': args.batch_size,
'scaling_config': ScalingConfig(
num_workers=num_workers,
resources_per_worker={'cpu': cpus_per_trial, 'gpu': gpus_per_trial},
use_gpu=True
),
'num_workers': num_workers,
'num_gpus': gpus_per_trial,
'num_cpus': cpus_per_trial,
}
search_alg = TuneBOHB(metric='loss', mode='min')
scheduler = ASHAScheduler(metric='loss', mode='min', max_t=args.epochs)
ray.init(num_cpus=cpus_per_trial, num_gpus=gpus_per_trial, local_mode=True)
tuner = ray.tune.Tuner(
tune.with_resources(partial(main, **kwargs), {'cpu': cpus_per_trial, 'gpu': gpus_per_trial}),
tune_config=ray.tune.TuneConfig(
num_samples=-1,
scheduler=scheduler,
search_alg=search_alg,
time_budget_s=3600*24,
),
param_space=config
)
results = tuner.fit()
Here are some relevant specs:
GPU: NVIDIA GeForce RTX 4090
Driver Version: 535.104.05
CUDA Version: 12.2
Kernel Version: 6.2.0-31-generic