Skip to content

4. Cluster Monitoring Setup Guide

INFO

A cluster built on PODsys can facilitate the configuration of a parallel environment. This section uses an example of setting up a cluster monitoring environment based on Docker. Setting up cluster monitoring with Docker offers the following advantages:

  • Consistent Environment
  • Rapid Deployment
  • Resource Isolation
  • Easy to Scale, Update, and Maintain
references

4.1 Cluster environment

Docker ImageManagement NodeCompute NodeRolePort Number
grafanaResponsible for displaying the Prometheus monitoring interface3000
PrometheusMain server of Prometheus9090
dcgm-exporterCollects GPU hardware information and operating system information9400
node-exporterCollects host hardware information and operating system information9100

Add the nexus user to the docker group

To avoid permission issues when using Docker subsequently, you need to add the user:nexus to the docker group.

shell
# Run on the management node
sudo usermod -aG docker nexus
newgrp docker
shell
# Run on the management node
pdsh -l root -R ssh -w ^hosts.txt "usermod -aG docker nexus"

TIP

The hosts.txt file records the IP addresses of all accessible compute nodes.
It needs to be generated through sudo ./config_server.sh -pre.

4.2 Installing and Configuring node-exporter

Introduction to node-exporter

In the Prometheus monitoring system, node_exporter is a monitoring service used for collecting and exposing machine-level system data. By exposing an interface that Prometheus can regularly scrape to obtain various metric data about the machine, node_exporter plays a vital role in the monitoring setup. It gathers a wide range of system-level metrics including CPU utilization, memory usage, disk space, network traffic, and more.
The operation of node_exporter involves running a separate process on the machine that periodically collects these metrics, which are then exposed to Prometheus. Prometheus sends regular requests to the node_exporter endpoint to fetch updated metric data. This data can be used by Prometheus for monitoring, alerting, analysis, and visualization purposes.
By utilizing node_exporter, you can efficiently collect and monitor system-level metric data from your machines, providing insight into their operational status and performance conditions. Moreover, since node_exporter is an open-source project, users have the flexibility to customize and extend it according to specific monitoring requirements.

Execute the following commands on the management node:

  • Pull the Docker image and save.
shell
docker pull prom/node-exporter
docker save -o /home/nexus/podsys/node-exporter.tar prom/node-exporter
  • Distribute the Image to Compute Nodes
shell
pdcp -R ssh -w ^hosts.txt /home/nexus/podsys/node-exporter.tar  /home/nexus/podsys/node-exporter.tar
  • Load Images on Compute Nodes
shell
pdsh -R ssh -w ^hosts.txt "docker load -i /home/nexus/podsys/node-exporter.tar"
  • Start Container
shell
# Run on the management node
docker run -d --name=node_exporter --net="host" --pid=host -v "/:/host:ro,rslave" prom/node-exporter:latest --path.rootfs=/host
shell
# Run on the management node
pdsh -R ssh -w ^hosts.txt "docker run -d --name=node_exporter --net="host" --pid=host -v "/:/host:ro,rslave" prom/node-exporter:latest --path.rootfs=/host"

4.3 Installing and Configuring dcgm-exporter

Introduction to dcgm-exporter

DCGM (Data Center GPU Manager) is a set of tools for managing and monitoring NVIDIA GPUs in data center cluster environments, offering proactive health monitoring, comprehensive diagnostics, system alerts, and power/clock management. The dcgm-exporter integrates with Prometheus to collect and expose GPU metrics from NVIDIA devices.

  • Pull the Docker image and save.
shell
docker pull nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
docker save -o /home/nexus/podsys/dcgm-exporter.tar nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
  • Distribute the Image to Compute Nodes
shell
pdcp -R ssh -w ^hosts.txt /home/nexus/podsys/dcgm-exporter.tar /home/nexus/podsys/dcgm-exporter.tar
  • Load Images on Compute Nodes
shell
pdsh -R ssh -w ^hosts.txt "docker load -i /home/nexus/podsys/dcgm-exporter.tar"
  • Start Container
shell
# Run on the management node
docker run -d --gpus all --cap-add SYS_ADMIN --rm --name=dcgm-exporter -p 9400:9400 nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
shell
# Run on the management node
pdsh -R ssh -w ^hosts.txt "docker run -d --gpus all --cap-add SYS_ADMIN --rm --name=dcgm-exporter -p 9400:9400 nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04"

4.4 Installing and Configuring Prometheus

  • Pull the Docker image.
shell
docker pull prom/prometheus
  • touch prometheus.yml
shell
sudo mkdir -p /data/prometheus
sudo touch /data/prometheus/prometheus.yml
sudo chmod 777 -R /data/prometheus
prometheus.yml
yaml
# Global configuration, defining the default scraping interval and timeout.
global:
  scrape_interval:     60s 
  scrape_timeout:      30s 

# List of scrape configurations for different metric jobs.
scrape_configs:
  # Prometheus self-scraping job configuration to monitor its own metrics.
  - job_name: 'prometheus'
    static_configs:
      - targets: ['192.168.1.1:9090']
  # DCGM Exporter scraping job configuration to collect GPU metrics.
  - job_name: 'dcgm_export'
    static_configs:
      - targets: ['192.168.1.1:9400', '192.168.1.2:9400']   
  # Node Exporter scraping job configuration to collect system metrics.
  - job_name: 'node_export'
    static_configs:
      - targets: ['192.168.1.1:9100', '192.168.1.2:9100']
  • Start Container on Management Node
shell
docker run -d --restart=always -p 9090:9090 --name=prometheus -v /data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
  • Check the Port
shell
sudo lsof -nP -i:9090
# Access the web page by entering:
# http://localhost:9090/graph
# http://localhost:9090/targets

4.5 Installing and Configuring grafana

  • Pull the Docker image.
shell
docker pull grafana/grafana
  • Start Container on Management Node
shell
sudo mkdir -p /data/grafana-storage
sudo chmod 777 -R /data/grafana-storage
shell
docker run -d --restart=always -p 3000:3000 --name=grafana -v /data/grafana-storage/:/var/lib/grafana grafana/grafana
shell
sudo lsof -nP -i:3000

Grafana Configuration

  • Open the web page of Grafana(http://localhost:3000) in your browser and log in with the default username and password: admin/admin.
  • Connections -> Data Sources -> Add Data Source -> Prometheus(input Prometheus server URL)-> Save & Test.
  • Dashboards -> New Dashboard -> Import -> ...(ID Template 1860:node-exporter,15117:dcgm-exporter) -> Load.

Copyright © 2025 The PODsys Project. All rights reserved.