Task Information in Ray Dashboard

The Ray dashboard contains information about various aspects of your cluster as its running, including information about actors, nodes, logs, memory-use, etc. However, one thing it currently lacks is information about Ray tasks. Currently, you can see tasks that are executing right now on a given worker, but no aggregate information about what tasks have run, task throughput, task lineage, etc.

We want to address this by adding more information about tasks to the dashboard. There are some metrics that we’re planning to track and expose on a per-task-name basis:

  1. Number executed (counter) - Number of a given task name that have been executed
  2. Number currently executing (gauge) - The number of tasks currently executing in the cluster
  3. Task queue time (histogram) - The amount of time a task spends between a task spec being submitted to a local raylet for scheduling and a worker being leased for the task
  4. Task execution time (histogram) - The amount of time a task spends between a worker being leased for a given task spec and the execution of the task completing

In addition, we’ve had requests to add task lineage–the ability to see what tasks are generating what other tasks, and thus to see where bottle-necks in one’s program may lie. Given the number of fine-grained tasks that may execute in a Ray cluster, there would need to be a way to drill down based on, for example, job. It also may be most useful for longer-running tasks.

Ultimately, I’d like to use this post as a place to discuss:

  1. What other task stats, if any, people would like to see in a tabular view. What questions would you like these stats to answer?
  2. Feedback on the idea of displaying a task graph. If you find the idea useful, I’d really appreciate it if you could take some time to describe how you’d like to use it and what you would like to accomplish.
3 Likes

For 3, it is probably better calling it task scheduling time (since tasks can be queued in many nodes multiple times)?

Also, we currently allow users to name tasks (thanks Clark for contribution :slight_smile: ). So, we should display both function names & task names.

Other useful info could be something like pid, worker id, job id, access to worker logs, and etc.

1 Like

I think a really good one would be understanding the current backlog, so as well as tasks that aren’t running what tasks aren’t being scheduled right now and what are the reasons they aren’t running? Perhaps it’d be useful to display this at a high level like:

Running:
- FuncA: 100 instances
- FuncB: 20 instances
Waiting:
- No available CPU resource:
– FuncC: 10 instances
- Blocked waiting for FuncA:
– FuncB: 80 instances

I’m trying to think “what display most quickly captures an understanding of the current task mechanics”

4 Likes

Not sure if that is something you are already planning but it would be really nice to have logs related to tasks.
One issue I had with remote functions is that once they finished running it is hard to find the logs which they produced, if you would have a possibility to see all logs related to on task that would be awesome

1 Like

It’s helpful to know where tasks are executing and how many tasks remain in queue.
I recently find some tasks not scheduled if they need more than 30+ minutes, but hard to know why.

1 Like

@crystalww +1 for this! I’m assuming the tasks’ info are somewhere in the cluster’s redis instance but can’t figure out where.