The Ray dashboard contains information about various aspects of your cluster as its running, including information about actors, nodes, logs, memory-use, etc. However, one thing it currently lacks is information about Ray tasks. Currently, you can see tasks that are executing right now on a given worker, but no aggregate information about what tasks have run, task throughput, task lineage, etc.
We want to address this by adding more information about tasks to the dashboard. There are some metrics that we’re planning to track and expose on a per-task-name basis:
- Number executed (counter) - Number of a given task
name
that have been executed - Number currently executing (gauge) - The number of tasks currently executing in the cluster
- Task queue time (histogram) - The amount of time a task spends between a task spec being submitted to a local raylet for scheduling and a worker being leased for the task
- Task execution time (histogram) - The amount of time a task spends between a worker being leased for a given task spec and the execution of the task completing
In addition, we’ve had requests to add task lineage–the ability to see what tasks are generating what other tasks, and thus to see where bottle-necks in one’s program may lie. Given the number of fine-grained tasks that may execute in a Ray cluster, there would need to be a way to drill down based on, for example, job. It also may be most useful for longer-running tasks.
Ultimately, I’d like to use this post as a place to discuss:
- What other task stats, if any, people would like to see in a tabular view. What questions would you like these stats to answer?
- Feedback on the idea of displaying a task graph. If you find the idea useful, I’d really appreciate it if you could take some time to describe how you’d like to use it and what you would like to accomplish.