Tuner crashes Terminal when switching to next Trial

I’ve been trying to run hyper parameter tuning for a Transformer model. However, sometimes, when a trial has finished all of its epochs and is switching to the next trial, I will get this error four times before my terminal crashes:

10:34:24 kernel: [135194.702527] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership

I’ve traced the issue to this thread.

I’m wondering if there is anything I can do from a Ray perspective? I’ve tried setting reuse_actors=False/True, but neither seem to work. I can’t really target a specific set of hyper parameters that causes this to occur, only that it usually occurs within the first 3 or 4 trials. I’ve checked the output immediately before it crashes, and everything seems normal.
Here is my code:

    cpus_per_trial = 26
    gpus_per_trial = 1 
    num_workers = 1 
    config = { 
        'lr': tune.loguniform(5.0e-06, 5.0e-4),
        'gamma': tune.uniform(0.90, 0.99),
        'layers_nr': tune.randint(1, 6), 
        'heads_nr': tune.randint(1, 8), 
        'hidden_nr': tune.randint(256, 4096),
        'dropout0': tune.uniform(0.00, 0.5),
        'dropout1': tune.uniform(0.00, 0.3),
        'epochs': args.epochs,
        'batch_size': args.batch_size,
        'scaling_config': ScalingConfig(
            resources_per_worker={'cpu': cpus_per_trial, 'gpu': gpus_per_trial},
        'num_workers': num_workers,
        'num_gpus': gpus_per_trial,
        'num_cpus': cpus_per_trial,
    search_alg = TuneBOHB(metric='loss', mode='min')
    scheduler = ASHAScheduler(metric='loss', mode='min', max_t=args.epochs)
    ray.init(num_cpus=cpus_per_trial, num_gpus=gpus_per_trial, local_mode=True)
    tuner = ray.tune.Tuner(
        tune.with_resources(partial(main, **kwargs), {'cpu': cpus_per_trial, 'gpu': gpus_per_trial}),
    results = tuner.fit()

Here are some relevant specs:

GPU: NVIDIA GeForce RTX 4090
Driver Version: 535.104.05
CUDA Version:  12.2
Kernel Version: 6.2.0-31-generic

You should definitely set reuse_actors=False in those cases, which will force re-initialization of the training process.

In your case, you should also set local_mode=False (or rather, remove the local_mode setting completely). local_mode starts all of Ray in one process (with threads) and can thus incur the same problem. local_mode should only be used for debugging, and even for that it’s deprecated and the Ray debugger should be used instead.

1 Like

Yep, setting local_mode=False is what fixed it. How much benefit would you expect to come from using reuse_actors=True vs reuse_actors=False?

It depends on how often trials are paused and how long an iteration is. reuse_actors=True removes the actor scheduling overhead (depends on the machine, but like 100-200ms), so when you run a lot of short running trials in parallel it is impactful. If it’s more on the order of a few dozen or hundred trials with a runtime of >10 seconds per trial, the difference is negligible.

1 Like