How to programatically do real-time monitoring of actor/task resource usage (heap memory/obj store memory/cpu)?

  • High: It blocks me to complete my task.

I want to be able to see how much memory/cpu a given actor/task is currently using (and possibly log this data/do certain application/scheduling decisions based on it). I would also like to programatically track shared obj store usage. Is there a python API for this?

cc: @sangcho @rickyyx @ericl

btw, can someone move this post under “Monitoring and Debugging” category?

@dirtyValera
Unfortunately, we currently don’t support cpu/memory usage per actor/task, but this is something we are looking into. One of the blockers is the cardinality of such data given the number of tasks/actor could be rather large in Ray.

I would also like to programatically track shared obj store usage. Is there a python API for this?

AFAIK, you could probably do the below (kind of hacky unfortunately):

  • For a cluster level resource usage, you could probably parse the obj store usage from autoscaler’s status. See example query usage from ray status here
  • Or if you have prometheus set up, you could also scrape the ray_object_store_memory programmatically metric

If you could share a bit more on your usecase, that would be great. We are actively working on the resources observability in the coming releases so knowing the usecases would help us prioritize

1 Like

Also this is the documentation regarding how to setup prometheus metrics! Metrics — Ray 3.0.0.dev0.

We recommend you to use Ray 2.1+ to use this feature.