Different trial on CPU and GPU separately?

Calvin · July 13, 2021, 6:41pm

I have access to a machine with 8 CPU cores and 1 GPU, I am wondering whether there is a way to run different trials on CPU and GPU separately? I don’t want all the trials sharing 1 GPU, and want to distribute the workload across the idling CPUs while having the GPU dedicate to a single trial at a time…

I am quite new to Ray Tune, I think this is a very simple problem but I wasn’t able to find the solution in search…

rliaw · July 13, 2021, 6:59pm

Unfortunately there isn’t great support for this yet It’s going to be a bit hard to design a good API for that.

As a workaround, you can do something like this:

import filelock

def gpu_wrapper(...):

    a = filelock.FileLock("/tmp/gpu.lock")
    try:
        # Makes it so that 1 trial will use the GPU at once.
        a.acquire(timeout=1)
        result = training_run(...)
    except filelock.Timeout:
        # If the lock is acquired, you can just use CPU, and disable GPU access.
        os.environ.pop("CUDA_VISIBLE_DEVICES")
        result = training_run(...)
    finally:
        # Release the lock after training is done.
        a.release()
    return result

def training_run(...):
    ...

tune.run(training_run, resources={"CPU": 1, "GPU": 0.1}

Calvin · July 13, 2021, 7:19pm

Hi @rliaw , although this looks a bit clumsy, but it serves the purpose. Let me try that later, thanks a lot!

Calvin · July 13, 2021, 9:53pm

Okay, I got it working. Just a small note to add. I think for the following line, it works for tensorflow not for pytorch:

os.environ.pop("CUDA_VISIBLE_DEVICES")

Therefore, in order to make it work, I have to add an input parameter force_cpu in my training_run

import filelock

def gpu_wrapper(...):

    a = filelock.FileLock("/tmp/gpu.lock")
    try:
        # Makes it so that 1 trial will use the GPU at once.
        a.acquire(timeout=1)
        result = training_run(..., force_cpu=False)
    except filelock.Timeout:
        # If the lock is acquired, you can just use CPU, and disable GPU access.
        result = training_run(..., force_cpu=True)
    finally:
        # Release the lock after training is done.
        a.release()
    return result

def training_run(..., force_cpu=False):
    # gpu usage
    if not force_cpu:
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    else:
        device = "cpu"
    
    ...

It is now running super fast, thanks again to @rliaw

mannyv · July 13, 2021, 10:43pm

@rliaw Couldn’t you define 2 different experiments with seperate configs one with gpu and one without and then use tune.run_experiments?

Calvin · July 14, 2021, 8:21am

@mannyv this way, you still have to define ‘force_cpu’ parameter in your training procedure like what I did above.

If you separate them by experiment, you will lose the Ray Tune sampling and consider them as separate experiment I believe.

Once one of them is finished, the core(s) will sit idle and waiting for the other experiment to finish, don’t think this is the best use of resources. Also, not to mention that you have to decide ahead which set goes to CPU and which set is going to GPU. Seems to me that too much hand picking is required.

rliaw · July 14, 2021, 9:39pm

Yep! Think of it as 2 queues. In the “2 experiments” case, each experiment would be placed in a separate queue, and it’s very easy for one to finish than the other. Calvin has a great explanation above

Calvin · March 30, 2022, 7:30am

@rliaw
Hi Richard, I am trying to make my package become open source. Due to company policy, they have to run Bandit for security check. The solution above using filelock does not pass the test, do you have any suggestion on how else you could implement this?

kai · April 4, 2022, 6:20pm

Hi @Calvin, what errors are you running into?

I think generally you could just do something like

tune.run(
    ...,
    resources_per_trial={"CPU": 1, "GPU": 0.01}
    config={
        "use_gpu": tune.grid_search([True, False, False, ...])
    }
)

And only use the GPU if config["use_gpu"] is True. Ray doesn’t enforce any memory limits, the "GPU": 0.01 is just to make sure that CUDA_VISIBLE_DEVICES is set correctly and that trials are only started on a machine that has a GPU available.

Topic		Replies	Views
Pytorch uses only one cpu per trial Ray Tune	2	548	December 3, 2021
How do I run my experiment on a single GPU? Ray Libraries (Data, Train, Tune, Serve)	4	1375	August 20, 2023
Allocating different GPUs to different instances of Ray(Tune) in python Ray Libraries (Data, Train, Tune, Serve)	5	448	August 9, 2023
Multiple trials on each GPU Ray Tune	1	473	February 19, 2021
Training trials in parallel on multi-gpu machine Ray Tune	8	1576	August 23, 2021

Different trial on CPU and GPU separately?

Related topics