Viewing Prometheus metrics in the dashboard of the VM cluster head-node

Hello

What are the exact steps and addresses required to enable Prometheus metrics in the dashboard for the head-node of a VM cluster.

I found this page which mentions environment variables RAY_PROMETHEUS_HOST and RAY_PROMETHEUS_NAME.
The page doesn’t explain however what addresses those variables should point to.

For example, the page mentions :

Set RAY_PROMETHEUS_HOST to an address the head node can use to access Prometheus.

If I’m starting a Prometheus server on my local machine, that would mean that I would somehow grant the head node running in the VM access to that server. The steps to do that are not documented.
It might also indicate that the easiest way is to run a Prometheus server on the head node. I tried looking if the recommended Ray docker image (rayproject/ray-ml:latest-gpu) may contain it, but I couldn’t find anything.

Could you help and indicate what is the best way to enable those plots.
For reference, this is the cluster YAML file I’m using:

auth:
  ssh_user: ptrochim
  ssh_public_key: /home/ptrochim/.ssh/id_ed25519.pub
  ssh_private_key: /home/ptrochim/.ssh/id_ed25519

cluster_name: ray-demo

# Cloud-provider specific configuration.
provider:
  type: gcp
  region: europe-west4
  availability_zone: europe-west4-c
  project_id: piotr-vm

idle_timeout_minutes: 5

# Without this docker image, SSH connects, but syncing-files failed with message: SSH command failed. Failed to setup head node
docker:
  image: "rayproject/ray-ml:latest-gpu" # rayproject/ray:latest-cpu
  container_name: "ray_container"
  pull_before_run: True
  run_options:
    - --ulimit nofile=65536:65536

# NOTE: Installing ray using pip instead of using the docker image doesn't work and breaks the fiole-syncing command
# head_setup_commands:
#   - pip install ray[data,train,tune,serve]

available_node_types:
  ray_head_default:
    resources: {"CPU": 1}
    node_config:
      machineType: n1-standard-2
      disks:
        - boot: true
          autoDelete: true
          type: PERSISTENT
          initializeParams:
            diskSizeGb: 50
            sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
  ray_worker_small:
    min_workers: 1
    max_workers: 2
    resources: {"CPU": 2, "GPU": 1}
    node_config:
        machineType: g2-standard-4
        disks:
          - boot: true
            autoDelete: true
            type: PERSISTENT
            initializeParams:
              diskSizeGb: 50
              sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-gpu-v20231105-ubuntu-2004-py310
        scheduling:
          - preemptible: true

head_node_type: ray_head_default
head_start_ray_commands:
  - ray stop
  - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

# CODE DEPENDENCY: We need 2 workers for the demos
max_workers: 2

All the best,
Piotr

Hey Pitor,

Sorry for the late reply.

If I’m starting a Prometheus server on my local machine, that would mean that I would somehow grant the head node running in the VM access to that server. The steps to do that are not documented.

I understand your frustration. However, the reason why we didn’t document it is that Prometheus is another OSS software outside of Ray project and we don’t intend or have the bandwidth to cover the documentation for how to use it or other OSS softwares.

Here is what GPT4 told me about your question.

To access a Prometheus server running on your local laptop from a virtual machine (VM) on AWS, you’ll need to set up a secure and reliable connection between the two. Here’s a general guide on how you can do this:

  1. Public IP and Port Forwarding:

    • Ensure your local laptop has a public IP address. If you’re behind a router, you might need to set up port forwarding to forward the Prometheus port (usually 9090) to your laptop.
    • Be aware of the security implications of exposing your local machine to the internet.
  2. VPN (Virtual Private Network):

    • Set up a VPN between your local network and the AWS VM. This is more secure than exposing your local machine directly to the internet.
    • Tools like OpenVPN or WireGuard can be used to create a secure tunnel.
  3. Reverse SSH Tunnel:

    • This is a technique where you establish a secure SSH connection from your local machine to the AWS VM and then tunnel the Prometheus port through this connection.
    • Run a command like ssh -R <REMOTE_PORT>:localhost:9090 <AWS_VM_USER>@<AWS_VM_IP> on your local machine.
  4. AWS VPC (Virtual Private Cloud) Peering:

    • If you have a VPC set up for your local network, you can establish VPC Peering with the AWS VPC where your VM resides.
    • This is more complex and typically used in enterprise environments.
  5. Security Groups and Firewall Settings:

    • Modify the security groups in AWS to allow traffic on the required ports (e.g., Prometheus port, VPN ports, SSH port).
    • Ensure your local firewall allows inbound connections on the Prometheus port.
  6. Testing the Connection:

    • Once the setup is complete, test the connection by accessing the Prometheus UI or API from the AWS VM using the configured method (e.g., http://localhost:<REMOTE_PORT> for SSH tunneling).
  7. Monitoring and Maintenance:

    • Regularly monitor the connection for any issues and keep your security settings updated.
  8. Using Cloud Services:

    • As an alternative, consider using cloud services like AWS Direct Connect for a more stable and secure connection, although this can be more costly.

Remember, exposing your local machine to the internet can be risky. It’s crucial to ensure that all security measures are properly configured and maintained. If you’re not comfortable with these steps or the security implications, you might want to consult with a network security professional.

Hi Huaiwei,

Thank you for your response. While I appreciate your point of view, it’s very customer unfriendly.
You made a design choice while creating the dashboard to depend on the external software. Without that software, your dashboard is incomplete.

Wouldn’t you agree that making the user experience as seamless as possible in your best interest?

Just a thought.

Best regards
Piotr Trochim

Thanks for the feedback. I agree that we want to make Ray’s user experience as seamless as possible. If we have enough resources, we can document everything but unfortunately we currently don’t. If we had enough resources, we would also consider using a custom graph library to render the graphs directly in the Ray Dashboard in the future.

If you are interested, I encourage you to contribute to this part to make the experience as seamless as possible for all the Ray users. This is the charm of an open source project. Everyone can contribute to making the experience better for everyone.

Yes, a very good point. Would you be happy with my adding the steps from chatgpt, after confirming they work, to the documentation?

Best regards
Piotr Trochim
~"~-~"~-~"~-~"~-~"~-~"~-~"~-~

Definitely. More than welcome to have you contribute to the docs. Some other pieces of the doc were contributed by other Ray users, like the Docker compose part