When enable nsight in ray, but can’t get CUDA related event in *rep file , and there is error in log: “metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:57105: Failed to connect to remote host: Connection refused; RPC Error details: . This won’t affect Ray, but you can lose metrics from the cluster.” But the port: “57105” is not the port when I start ray head node: “ray start --block --head --metrics-export-port=62407” , why ? alos the port always changed when run again . please give suggestion how to set the connection port here ? thanks
How severe does this issue affect your experience of using Ray?
None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
This error message may be unrelated to your problem of trying to get “CUDA related events”.
Can you explain more about what you mean by CUDA related events?
In the meanwhile, I’ll briefly explain metrics_exporter.cc class. It’s a class that gathers system and application metrics and forwards them to a prometheus export endpoint. This is used to power time-series metrics like these.
Thanks Aguo. I enabled insight profiling in ray for multi-cards (or 1 card) in 1 node. after enabled, the profiling result have been got in /tmp/ray/session_*/logs/{profiler_name} directory as mentioned in doc (Profiling — Ray 2.35.0) . But when I use nsight GUI app to open profiling rep file, there always error: “No CUDA events collected. Does the process use CuDA?” but actually, the WL indeeded used CUDA. if we not use ray as backend and directly using nsight to do profiling, the report has CUDA events. that why I wondered the issue happening on Ray and got above error(but from your explanation, seems no related this ) so could you explain why we can’t got CUDA events when using ray as backend?