Accessing used resources per trial

Hi all,

For the models we are trying to tune, an important metric is their resource requirements (i.e. training time and memory usage). I’m familiar with the resources_per_trial kwarg to set available resources per trial, but am interested to know if we can get information on the resources that are actually used.

I’ve noticed Ray Tune prints out Memory usage on this node and Resources requested to the terminal. Is there a way to get similar information on a per-trial basis? I’ve thought about calling a memory profiler such as memray inside of the ray.tune experiment, but this seems like overkill since we are only interested in the maximum memory usage per trial, and in any case I haven’t found a memory profiler that plays nicely when running trials concurrently. Has anyone got any experience with doing something similar?

Thanks in advance!
Ashley

Hi @ascillitoe,

there’s some work on this in the Ray observabilitysection in the docs - does that help? Exporting Metrics — Ray 1.13.0
If not, please let us know what would make it clearer for you. Also cc @sangcho who is familiar with the observability work :slight_smile:

Hi @kai, many thanks for the fast response and the suggestion to look at the “Exporting Metrics” section, I hadn’t noticed this. Are the metrics logged to Prometheus the same as those reported by ray.nodes()? I ask for two reasons:

  1. It doesn’t look to me like ray.nodes() gives resources used/consumed on a per-trail basis? From what I can see it looks like the memory and CPU fields reported in Resources are the total available on the node, rather than that used/consumed by each individual trial on a node?
  2. We’d ideally like to access the reported resource metrics via the Python API, rather than going via Prometheus. This is because we log all our experiments to Aim via a custom callback, and what we’d ideally like to do is have the maximum memory and cpu usage per trial recorded as metrics in Aim i.e. the resource metrics would be reported like any other parameter in tune.report().

Thanks again!
Ashley