Ram utilization logs for multi-node Tune

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

#######################

What’s ray/tune/perf/ram_util_percent on Tensorboard for a multi-node Tune session?
Is that the head node? If so, is there a way to get information about the memory on the other nodes?

One of the nodes crashed during the training and I’m trying to figure out why. I’m suspecting an OOM situation, but the Slurm node just rebooted unexpectedly, so it’s hard to tell.

Hi @vakker00,

the ram_util_percent is the memory utilization on the node where the trainable is scheduled. So if you have single worker trainables (e.g. multiple parallel trials, but each trial only uses one node), this will be the machine utilization of the respective nodes. It will contain the full utilization, so if there are e.g. 2 trials running on the same node, each occupying 300MB of ram, both will will report a usage of 600MB.

Also, if your trainables start distributed workers (e.g. with RLlib or Ray Train), their node memory utilization will not be captured.

Does that help?

Hi @kai

Thanks, that helps!