How can I get the `gpu_id` assigned to the trial using the `trial_id`?

marload · November 26, 2020, 4:44am

Hello, Ray!

I’m performing HPO using Tune’s Concurrent Trial. RayTune automatically detects GPUs and assigns GPUs to Trial. How can I get the gpu_id assigned to the Trial using trial_id ?

rliaw · November 26, 2020, 5:03am

Hey @marload, glad to see you here! Big fan of your work on Hyperopt

Can you describe what you’re trying to achieve?

On each training function, you can retrieve the gpu_id by calling ray.get_gpu_ids.

marload · November 26, 2020, 5:31am

Hi, Richard! @rliaw

Our company plans to actively utilize Ray Echosystem.
I am developing my company’s research on-premise GPU Cluster based on Ray. One of the goals I’m trying to achieve is that when Tune is running, if a specific request occurs with the GPU ID from outside, it immediately stops the trial assigned to that GPU ID, and does not run the trial on that GPU until the request comes again. Which approach do you think would be good?

FYI, I am also a really huge fan of Ray and Anyacale.

rliaw · November 26, 2020, 7:36am

Thank you for your kind words!

Hmm, this sounds a bit tricky. Can I ask you a bit more about the business use case for why you want to do this?

As for implementation, I think one possibility is to do this entirely within the trainable. That is, every time you receive a request with a GPU ID on a node, you have a server that marks this notification in a file.

In the trainable function/class that you are using (tune.run(trainable)), you should run a separate thread that checks that “notification file”, and pauses execution/releases GPU memory until the notification file is marked.

Does this make sense? It is a bit hacky, but I think it should meet your requirements. You can use something like https://github.com/Stonesjtu/pytorch_memlab#courtesy to release GPU memory.

marload · November 27, 2020, 1:56am

@rliaw
As you said, I solved this problem by creating Tiny Server.

I’m working on slowly changing the company’s existing ML research process to Ray Echoystem, and this issue is the first task.

It’s not an issue question, but where should I ask questions about Anyscale solutions?

rliaw · November 27, 2020, 5:34am

Awesome I’ll reach out on slack DM to chat about Anyscale!

Topic		Replies	Views
"Tune detects GPUs" warning trigger even though GPU is requested in resources_per_trial Ray Tune	1	722	September 2, 2021
Using specific GPUs in a shared machine Ray Tune	6	2838	March 24, 2022
How to allocate specific gpu to trials when use ray tune Ray Tune	4	521	July 25, 2021
Under-utilization of gpus at end of experiment Ray Tune	3	330	December 21, 2022
Running trials on GPU and the Object Storage Ray Tune	2	335	July 28, 2021

How can I get the `gpu_id` assigned to the trial using the `trial_id`?

Related topics