When enable nsight in ray, but can't get CUDA related event, there is error in log: metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:57105: Fa

When enable nsight in ray, but can’t get CUDA related event in *rep file , and there is error in log: “metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:57105: Failed to connect to remote host: Connection refused; RPC Error details: . This won’t affect Ray, but you can lose metrics from the cluster.” But the port: “57105” is not the port when I start ray head node: “ray start --block --head --metrics-export-port=62407” , why ? alos the port always changed when run again . please give suggestion how to set the connection port here ? thanks

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

Can you try not setting up port and use the default one that Ray starts with and see if your connection succeeds that way?

yes, using the default one can’t resolve this issue. not sure how the mechanism behind. could explain more? thanks !

This error message may be unrelated to your problem of trying to get “CUDA related events”.

Can you explain more about what you mean by CUDA related events?

In the meanwhile, I’ll briefly explain metrics_exporter.cc class. It’s a class that gathers system and application metrics and forwards them to a prometheus export endpoint. This is used to power time-series metrics like these.

Thanks Aguo. I enabled insight profiling in ray for multi-cards (or 1 card) in 1 node. after enabled, the profiling result have been got in /tmp/ray/session_*/logs/{profiler_name} directory as mentioned in doc (Profiling — Ray 2.35.0) . But when I use nsight GUI app to open profiling rep file, there always error: “No CUDA events collected. Does the process use CuDA?” but actually, the WL indeeded used CUDA. if we not use ray as backend and directly using nsight to do profiling, the report has CUDA events. that why I wondered the issue happening on Ray and got above error(but from your explanation, seems no related this :slight_smile: ) so could you explain why we can’t got CUDA events when using ray as backend?