Accessing used resources per trial

ascillitoe · July 27, 2022, 5:30pm

Hi all,

For the models we are trying to tune, an important metric is their resource requirements (i.e. training time and memory usage). I’m familiar with the resources_per_trial kwarg to set available resources per trial, but am interested to know if we can get information on the resources that are actually used.

I’ve noticed Ray Tune prints out Memory usage on this node and Resources requested to the terminal. Is there a way to get similar information on a per-trial basis? I’ve thought about calling a memory profiler such as memray inside of the ray.tune experiment, but this seems like overkill since we are only interested in the maximum memory usage per trial, and in any case I haven’t found a memory profiler that plays nicely when running trials concurrently. Has anyone got any experience with doing something similar?

Thanks in advance!
Ashley

kai · July 28, 2022, 9:03am

Hi @ascillitoe,

there’s some work on this in the Ray observabilitysection in the docs - does that help? Exporting Metrics — Ray 1.13.0
If not, please let us know what would make it clearer for you. Also cc @sangcho who is familiar with the observability work

ascillitoe · July 28, 2022, 12:15pm

Hi @kai, many thanks for the fast response and the suggestion to look at the “Exporting Metrics” section, I hadn’t noticed this. Are the metrics logged to Prometheus the same as those reported by ray.nodes()? I ask for two reasons:

It doesn’t look to me like ray.nodes() gives resources used/consumed on a per-trail basis? From what I can see it looks like the memory and CPU fields reported in Resources are the total available on the node, rather than that used/consumed by each individual trial on a node?
We’d ideally like to access the reported resource metrics via the Python API, rather than going via Prometheus. This is because we log all our experiments to Aim via a custom callback, and what we’d ideally like to do is have the maximum memory and cpu usage per trial recorded as metrics in Aim i.e. the resource metrics would be reported like any other parameter in tune.report().

Thanks again!
Ashley

Topic		Replies	Views
Different resource amount for different trials Ray Tune	8	478	July 14, 2021
Ray Resources Per Trial Ray Tune	1	126	September 22, 2024
Adding memory in resources_per_trial in tune.run() hangs	2	406	October 28, 2022
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	15	December 18, 2024
Specifying memory requirement for RLlib algorithms in Ray Tune etc RLlib	3	392	January 7, 2023

Accessing used resources per trial

Related topics