When enable nsight in ray, but can't get CUDA related event, there is error in log: metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:57105: Fa

paladin2000_cn · August 28, 2024, 7:45am

When enable nsight in ray, but can’t get CUDA related event in *rep file , and there is error in log: “metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:57105: Failed to connect to remote host: Connection refused; RPC Error details: . This won’t affect Ray, but you can lose metrics from the cluster.” But the port: “57105” is not the port when I start ray head node: “ray start --block --head --metrics-export-port=62407” , why ? alos the port always changed when run again . please give suggestion how to set the connection port here ? thanks

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.

Sam_Chan · August 28, 2024, 11:43pm

Can you try not setting up port and use the default one that Ray starts with and see if your connection succeeds that way?

paladin2000_cn · August 29, 2024, 12:33am

yes, using the default one can’t resolve this issue. not sure how the mechanism behind. could explain more? thanks !

aguo · September 4, 2024, 9:48pm

This error message may be unrelated to your problem of trying to get “CUDA related events”.

Can you explain more about what you mean by CUDA related events?

In the meanwhile, I’ll briefly explain metrics_exporter.cc class. It’s a class that gathers system and application metrics and forwards them to a prometheus export endpoint. This is used to power time-series metrics like these.

paladin2000_cn · September 5, 2024, 1:06am

Thanks Aguo. I enabled insight profiling in ray for multi-cards (or 1 card) in 1 node. after enabled, the profiling result have been got in /tmp/ray/session_*/logs/{profiler_name} directory as mentioned in doc (Profiling — Ray 2.35.0) . But when I use nsight GUI app to open profiling rep file, there always error: “No CUDA events collected. Does the process use CuDA?” but actually, the WL indeeded used CUDA. if we not use ray as backend and directly using nsight to do profiling, the report has CUDA events. that why I wondered the issue happening on Ray and got above error(but from your explanation, seems no related this ) so could you explain why we can’t got CUDA events when using ray as backend?

Topic		Replies	Views
Gcs_rpc_client.h:179: Failed to connect to GCS at address 192.168.85.116:6379 within 5 seconds Configure Algorithm, Training, Evaluation, Scaling	4	1350	February 12, 2025
Ray Monitor Not Connecting to Grafana and Prometheus Dashboard, Monitoring & Debugging	22	3000	January 16, 2024
Some issues when using the Ray Client	1	897	January 21, 2021
Issue with ray cluster in Red hat machine Ray Clusters	1	478	August 26, 2022
[Clusters] Ray status unable to connect to minikube Kubernetes	1	600	June 3, 2021

When enable nsight in ray, but can't get CUDA related event, there is error in log: metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:57105: Fa

Related topics