Ray Monitor Not Connecting to Grafana and Prometheus

I am running Ray 2.3.1 on my Mac Pro. I also have Grafana and Prometheus running on this machine. I have verified that both are working by checking localhost:3000 and localhost:9090, respectively. I launch a local Ray cluster like so

export RAY_GRAFANA_HOST=http://127.0.0.1:3000
export RAY_PROMETHEUS_HOST=http://127.0.0.1:9090
ray start --head

Ray starts. The Ray monitor at 127.0.0.1:8265 shows broken cluster monitoring windows. The screen looks like this

If I hover over one of the windows I see the message “127.0.0.1 refused to connect”.

The Ray cluster itself works correctly, as does the Recent jobs tab of the monitor.

I have tried adding export RAY_GRAFANA_IFRAME_HOST=http://127.0.0.1:3000, as well as not setting any of these environment variables, and see the same result.

I watched the web traffic with Chrome developer tools while refreshing the Ray monitor web page. The following things looked wrong:

  • Two calls to roboto-latin.500 on the Ray monitor port failed with the message “Failed to load response data. No data found for resource for given identifier” in the Response tab.
  • Two calls to default-dashboard?... on the Grafana port showed the message “Failed to load response data: No content available because this request was redirected” in the Response tab.
  • Two calls to login on the Grafana port showed the message “Failed to load response data. No resource with the given identifier found” in the Response tab.

How do I get Grafana and Prometheus to integrate with Ray?

1 Like

@rickyyx @sangcho Do we have specific instructions how to install Grafana and Prometheus on local host and how Ray dashboard can discover its configs?

Does @wpm have to use the dashboard command: ray dashboard [-p <port, 8265 by default>] <cluster config file>

To the best of my knowledge I followed the documentation instructions you linked to correctly.

I’ll try running ray dashboard <cluster config file>, but I don’t know where my cluster config file is. I’m just having Ray create a local cluster by default.

ray dashboard is not needed for a local ray cluster. Hmm. I just tried to set those up on my macbook pro, it worked fine.

  • Grafana 9.4.7
  • Prometheus 2.43.0
  • Grafana started with brew services start grafana
  • Prometheus started with docker run -p 9090:9090 prom/prometheus
  • I don’t think I have any dashboards in Grafana. I poked around the UI and didn’t see anything.

I have a dashboards page that looks like this

According to: Metrics — Ray 2.3.1


You need to start prometheus and grafana with the config files provided by Ray so that:

  • prometheus can scrap the metrics from the ray cluster properly
  • grafana can talk to the prometheus and visualize the metrics with the template dashboard provided by Ray

Can you give it a try?

That worked. The Cluster Utilization and Node Count windows now display data.

For reference of anybody else who hits this, here is exactly how I made this work on my Mac.

  1. brew install grafana

  2. brew install prometheus

  3. Change the --config-file line in /usr/local/etc/prometheus.args to read --config.file /tmp/ray/session_latest/metrics/prometheus/prometheus.yml.

  4. Uncomment the appropriate lines in /usr/local/etc/grafana/grafana.ini so that it matches the contents of /tmp/ray/session_latest/metrics/grafana/grafana.ini.

  5. brew services start grafana

  6. brew services start prometheus

  7. ray start --head

Thanks for your help.

2 Likes

Glad that it works out!

@aguo We probably should add some guides for homebrew-based workflows ^. Added it to our backlog.

Command line mode could work like this.

./prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml

grafana-server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web

My first question is on node machine, embedding metic web cannot display charts, but head can. How to set make node web display right?

My second question is when use docker-compose pull up container of grafana and promethus. Ray dashboard embedding metic web part cannot found any chart. List docker-compose.yml file what used. Seems like network didn’t set right.

version: '3'

networks:
    ray_dashboard:
        driver: bridge

services:
    prometheus:
        image: prom/prometheus
        container_name: prometheus
        hostname: prometheus
        restart: always
        volumes:
            - /tmp/ray/session_latest/metrics/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
        ports:
            - "9090:9090"
        networks:
            - ray_dashboard
        # network_mode: "host"

    grafana:
        image: grafana/grafana
        container_name: grafana
        hostname: grafana
        restart: always
        # environment:
          # GF_PATHS_CONFIG: /tmp/ray/session_latest/metrics/grafana/grafana.ini
        volumes:
            - /tmp/ray/session_latest/metrics/grafana/grafana.ini:/etc/grafana/grafana.ini
        ports:
            - "3000:3000"
        networks:
            - ray_dashboard

Official user guide didn’t give information about this. Docker compose is more convient than download grafana and promethus respectly.

My first question is on node machine, embedding metic web cannot display charts, but head can. How to set make node web display right?

I’m not sure if I understand your questions. Can you elaborate? All the env variables need to be set up on head node and the dashboard process is run on head node.

My second question is when use docker-compose pull up container of grafana and promethus.

Check out the setup guide and the requirements here Configuring and Managing Ray Dashboard — Ray 2.8.0. We cannot cover different ways to install/run grafana/prometheus but as long as the setup meets the requirements listed in the documentation, it should work. Let us know if you still run into issues.

On head node, child node could access dashboad:

ray start --head --dashboard-host="0.0.0.0"

Grafana and premethus run on head node:

grafana-server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web

./prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml

On head node, all work right. On child node, grafana couldn’t display.

For latest grafana OSS version 10.0.0:

sudo grafana server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web

On child node, grafana couldn’t display.

You mean worker node? Whatl do you mean by grafana couldn’t display?

I have met that problem on ray 2.4.0. In LAN network, except head node, every machine which could access head node could use web brower access dashboard, but grafana embedding part cannot display. I will down grade later to reproduce the problem.

I don‘'t know what ray 2.5.0 changed on dashboard. A new error. Please look at my screenshot.

Let’s go back dashboard issue.

On child node(192.168.0.104):

On head node(192.168.0.102):

Here mention change environment variable RAY_GRAFANA_HOST and RAY_PROMETHEUS_HOST.
https://docs.ray.io/en/latest/cluster/configure-manage-dashboard.html#embed-grafana-in-dashboard

But in this document, cli mode start cluster didn’t gave us an example about how to set environment variable in runtime:
https://docs.ray.io/en/latest/ray-core/starting-ray.html#start-ray-cli

@funk_Jz the error in 2.5 indicates that prometheus.yml is directory. Do you know why? It’s not related to ray but more to your system.

  1. Please use the head node to access Ray Dashboard. I don’t think it works on worker node… cc: @sangcho
  2. If you don’t use VMs (Ray on Cloud VMs — Ray 2.5.1) or K8s (Ray on Kubernetes — Ray 2.5.1) to start ray cluster, try setting the env variable on the head node before starting the cluster manually.

Haha, I think it is a metaphysics issue(prometheus.yml change to a directory). So clear up the /tmp/ray, and restart cluster. Everything goes well.

  1. Please use the head node to access Ray Dashboard. I don’t think it works on worker node… cc: @sangcho

For the issue itself, we embed the Grafana page to the dashboard. So your child node probably is not able to access the embedded Grafana. As @Huaiwei_Sun said, this is not very well supported use case (but if you’d like to fix it, you should make sure the child node can access the Grafana UI).

I thought this was solved, but it’s not.

I did the same steps I outlined in my post from April 1, 2023, on a different machine and again seeing the “localhost refused to connect” problem in the Cluster Utilization and Node Count windows.

I am running Prometheus and Grafana from a Homebrew install. They are both working.

% brew services
Name       Status  User    File
grafana    started mcneill ~/Library/LaunchAgents/homebrew.mxcl.grafana.plist
prometheus started mcneill ~/Library/LaunchAgents/homebrew.mxcl.prometheus.plist

I can see web pages at http://localhost:9090 and http://localhost:3000.

The Prometheus log at /opt/homebrew/var/log/prometheus.err.log shows that /tmp/ray/session_latest/metrics/prometheus/prometheus.yml has been loaded.

ts=2024-01-14T01:47:20.741Z caller=main.go:539 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2024-01-14T01:47:20.741Z caller=main.go:583 level=info msg="Starting Prometheus Server" mode=server version="(version=2.48.1, branch=non-git, revision=non-git)"
ts=2024-01-14T01:47:20.741Z caller=main.go:588 level=info build_context="(go=go1.21.5, platform=darwin/arm64, user=brew@Sonoma-arm64.local, date=20231208-09:22:46, tags=netgo,builtinassets,stringlabels)"
ts=2024-01-14T01:47:20.741Z caller=main.go:589 level=info host_details=(darwin)
ts=2024-01-14T01:47:20.741Z caller=main.go:590 level=info fd_limits="(soft=61440, hard=unlimited)"
ts=2024-01-14T01:47:20.741Z caller=main.go:591 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2024-01-14T01:47:20.743Z caller=web.go:566 level=info component=web msg="Start listening for connections" address=127.0.0.1:9090
ts=2024-01-14T01:47:20.743Z caller=main.go:1024 level=info msg="Starting TSDB ..."
ts=2024-01-14T01:47:20.743Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1705177341653 maxt=1705183200000 ulid=01HM2TA7KYTCCA3NT2A39Z3GM4
ts=2024-01-14T01:47:20.743Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1705185811653 maxt=1705190400000 ulid=01HM2TA7XPS25VJ6W7K6MZ7QCN
ts=2024-01-14T01:47:20.743Z caller=tls_config.go:274 level=info component=web msg="Listening on" address=127.0.0.1:9090
ts=2024-01-14T01:47:20.743Z caller=tls_config.go:277 level=info component=web msg="TLS is disabled." http2=false address=127.0.0.1:9090
ts=2024-01-14T01:47:20.745Z caller=head.go:601 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2024-01-14T01:47:20.746Z caller=head.go:682 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=385.208µs
ts=2024-01-14T01:47:20.746Z caller=head.go:690 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2024-01-14T01:47:20.753Z caller=head.go:726 level=info component=tsdb msg="WAL checkpoint loaded"
ts=2024-01-14T01:47:20.754Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=68 maxSegment=72
ts=2024-01-14T01:47:20.755Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=69 maxSegment=72
ts=2024-01-14T01:47:20.767Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=70 maxSegment=72
ts=2024-01-14T01:47:20.767Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=71 maxSegment=72
ts=2024-01-14T01:47:20.767Z caller=head.go:761 level=info component=tsdb msg="WAL segment loaded" segment=72 maxSegment=72
ts=2024-01-14T01:47:20.767Z caller=head.go:798 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=7.654417ms wal_replay_duration=13.646042ms wbl_replay_duration=42ns total_replay_duration=21.713917ms
ts=2024-01-14T01:47:20.769Z caller=main.go:1045 level=info fs_type=1a
ts=2024-01-14T01:47:20.769Z caller=main.go:1048 level=info msg="TSDB started"
ts=2024-01-14T01:47:20.769Z caller=main.go:1230 level=info msg="Loading configuration file" filename=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml
ts=2024-01-14T01:47:20.787Z caller=main.go:1267 level=info msg="Completed loading of configuration file" filename=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml totalDuration=17.422708ms db_storage=625ns remote_storage=791ns web_handler=208ns query_engine=458ns scrape=17.257833ms scrape_sd=27.292µs notify=791ns notify_sd=1.125µs rules=1.459µs tracing=7.208µs
ts=2024-01-14T01:47:20.787Z caller=main.go:1009 level=info msg="Server is ready to receive web requests."
ts=2024-01-14T01:47:20.787Z caller=manager.go:1012 level=info component="rule manager" msg="Starting rule manager..."

My /usr/local/etc/prometheus.args file looks like this:

--config.file /tmp/ray/session_latest/metrics/prometheus/prometheus.yml

My /tmp/ray/session_latest/metrics/prometheus/prometheus.yml file looks like this:

# my global config
global:
  scrape_interval: 10s # Set the scrape interval to every 10 seconds. Default is every 1 minute.
  evaluation_interval: 10s # Evaluate rules every 10 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

scrape_configs:
# Scrape from each ray node as defined in the service_discovery.json provided by ray.
- job_name: 'ray'
  file_sd_configs:
  - files:
    - '/tmp/ray/prom_metrics_service_discovery.json'

My /usr/local/etc/grafana/grafana.ini file looks like this:

[security]
allow_embedding = true

[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Viewer

[paths]
provisioning = /tmp/ray/session_latest/metrics/grafana/provisioning

What am I doing wrong? Is there anything I can do to debug this. Any error messages anywhere?