Different trial on CPU and GPU separately?

I have access to a machine with 8 CPU cores and 1 GPU, I am wondering whether there is a way to run different trials on CPU and GPU separately? I don’t want all the trials sharing 1 GPU, and want to distribute the workload across the idling CPUs while having the GPU dedicate to a single trial at a time…

I am quite new to Ray Tune, I think this is a very simple problem but I wasn’t able to find the solution in search…

Unfortunately there isn’t great support for this yet :slight_smile: It’s going to be a bit hard to design a good API for that.

As a workaround, you can do something like this:

import filelock

def gpu_wrapper(...):

    a = filelock.FileLock("/tmp/gpu.lock")
    try:
        # Makes it so that 1 trial will use the GPU at once.
        a.acquire(timeout=1)
        result = training_run(...)
    except filelock.Timeout:
        # If the lock is acquired, you can just use CPU, and disable GPU access.
        os.environ.pop("CUDA_VISIBLE_DEVICES")
        result = training_run(...)
    finally:
        # Release the lock after training is done.
        a.release()
    return result

def training_run(...):
    ...

tune.run(training_run, resources={"CPU": 1, "GPU": 0.1}
1 Like

Hi @rliaw , although this looks a bit clumsy, but it serves the purpose. Let me try that later, thanks a lot!

Okay, I got it working. Just a small note to add. I think for the following line, it works for tensorflow not for pytorch:

os.environ.pop("CUDA_VISIBLE_DEVICES")

Therefore, in order to make it work, I have to add an input parameter force_cpu in my training_run

import filelock

def gpu_wrapper(...):

    a = filelock.FileLock("/tmp/gpu.lock")
    try:
        # Makes it so that 1 trial will use the GPU at once.
        a.acquire(timeout=1)
        result = training_run(..., force_cpu=False)
    except filelock.Timeout:
        # If the lock is acquired, you can just use CPU, and disable GPU access.
        result = training_run(..., force_cpu=True)
    finally:
        # Release the lock after training is done.
        a.release()
    return result

def training_run(..., force_cpu=False):
    # gpu usage
    if not force_cpu:
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    else:
        device = "cpu"
    
    ...

It is now running super fast, thanks again to @rliaw :grin: :+1: :sparkles:

@rliaw Couldn’t you define 2 different experiments with seperate configs one with gpu and one without and then use tune.run_experiments?

@mannyv this way, you still have to define ‘force_cpu’ parameter in your training procedure like what I did above.

If you separate them by experiment, you will lose the Ray Tune sampling and consider them as separate experiment I believe.

Once one of them is finished, the core(s) will sit idle and waiting for the other experiment to finish, don’t think this is the best use of resources. Also, not to mention that you have to decide ahead which set goes to CPU and which set is going to GPU. Seems to me that too much hand picking is required.

Yep! Think of it as 2 queues. In the “2 experiments” case, each experiment would be placed in a separate queue, and it’s very easy for one to finish than the other. Calvin has a great explanation above :slight_smile:

@rliaw
Hi Richard, I am trying to make my package become open source. Due to company policy, they have to run Bandit for security check. The solution above using filelock does not pass the test, do you have any suggestion on how else you could implement this?

Hi @Calvin, what errors are you running into?

I think generally you could just do something like

tune.run(
    ...,
    resources_per_trial={"CPU": 1, "GPU": 0.01}
    config={
        "use_gpu": tune.grid_search([True, False, False, ...])
    }
)

And only use the GPU if config["use_gpu"] is True. Ray doesn’t enforce any memory limits, the "GPU": 0.01 is just to make sure that CUDA_VISIBLE_DEVICES is set correctly and that trials are only started on a machine that has a GPU available.