4. Cluster Monitoring Setup Guide

INFO

A cluster built on PODsys can facilitate the configuration of a parallel environment. This section uses an example of setting up a cluster monitoring environment based on Docker. Setting up cluster monitoring with Docker offers the following advantages:

Consistent Environment
Rapid Deployment
Resource Isolation
Easy to Scale, Update, and Maintain

references

4.1 Cluster environment

Docker Image	Management Node	Compute Node	Role	Port Number
grafana	✓		Responsible for displaying the Prometheus monitoring interface	3000
Prometheus	✓		Main server of Prometheus	9090
dcgm-exporter	✓	✓	Collects GPU hardware information and operating system information	9400
node-exporter	✓	✓	Collects host hardware information and operating system information	9100

Add the nexus user to the docker group

To avoid permission issues when using Docker subsequently, you need to add the user:nexus to the docker group.

Management NodeCompute Node

shell

# Run on the management node
sudo usermod -aG docker nexus
newgrp docker

shell

# Run on the management node
pdsh -l root -R ssh -w ^hosts.txt "usermod -aG docker nexus"

TIP

The hosts.txt file records the IP addresses of all accessible compute nodes.
It needs to be generated through sudo ./config_server.sh -pre.

4.2 Installing and Configuring node-exporter

Introduction to node-exporter

In the Prometheus monitoring system, node_exporter is a monitoring service used for collecting and exposing machine-level system data. By exposing an interface that Prometheus can regularly scrape to obtain various metric data about the machine, node_exporter plays a vital role in the monitoring setup. It gathers a wide range of system-level metrics including CPU utilization, memory usage, disk space, network traffic, and more.
The operation of node_exporter involves running a separate process on the machine that periodically collects these metrics, which are then exposed to Prometheus. Prometheus sends regular requests to the node_exporter endpoint to fetch updated metric data. This data can be used by Prometheus for monitoring, alerting, analysis, and visualization purposes.
By utilizing node_exporter, you can efficiently collect and monitor system-level metric data from your machines, providing insight into their operational status and performance conditions. Moreover, since node_exporter is an open-source project, users have the flexibility to customize and extend it according to specific monitoring requirements.

Execute the following commands on the management node:

Pull the Docker image and save.

shell

docker pull prom/node-exporter
docker save -o /home/nexus/podsys-bxx/node-exporter.tar prom/node-exporter

Distribute the Image to Compute Nodes

shell

pdcp -R ssh -w ^hosts.txt /home/nexus/podsys-bxx/node-exporter.tar  /podsys/node-exporter.tar

Load Images on Compute Nodes

shell

pdsh -R ssh -w ^hosts.txt "docker load -i /podsys/node-exporter.tar"

Start Container

Management NodeCompute Node

shell

# Run on the management node
docker run -d --name=node_exporter --net="host" --pid=host -v "/:/host:ro,rslave" prom/node-exporter:latest --path.rootfs=/host

shell

# Run on the management node
pdsh -R ssh -w ^hosts.txt "docker run -d --name=node_exporter --net="host" --pid=host -v "/:/host:ro,rslave" prom/node-exporter:latest --path.rootfs=/host"

4.3 Installing and Configuring dcgm-exporter

Introduction to dcgm-exporter

DCGM (Data Center GPU Manager) is a set of tools for managing and monitoring NVIDIA GPUs in data center cluster environments, offering proactive health monitoring, comprehensive diagnostics, system alerts, and power/clock management. The dcgm-exporter integrates with Prometheus to collect and expose GPU metrics from NVIDIA devices.

Pull the Docker image and save.

shell

docker pull nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
docker save -o /home/nexus/podsys-bxx/dcgm-exporter.tar nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04

Distribute the Image to Compute Nodes

shell

pdcp -R ssh -w ^hosts.txt /home/nexus/podsys-bxx/dcgm-exporter.tar /podsys/dcgm-exporter.tar

Load Images on Compute Nodes

shell

pdsh -R ssh -w ^hosts.txt "docker load -i /podsys/dcgm-exporter.tar"

Start Container

Management NodeCompute Node

shell

# Run on the management node
docker run -d --gpus all --cap-add SYS_ADMIN --rm --name=dcgm-exporter -p 9400:9400 nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04

shell

# Run on the management node
pdsh -R ssh -w ^hosts.txt "docker run -d --gpus all --cap-add SYS_ADMIN --rm --name=dcgm-exporter -p 9400:9400 nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04"

4.4 Installing and Configuring Prometheus

Pull the Docker image.

shell

docker pull prom/prometheus

touch prometheus.yml

shell

sudo mkdir -p /data/prometheus
sudo touch /data/prometheus/prometheus.yml
sudo chmod 777 -R /data/prometheus

prometheus.yml

yaml

# Global configuration, defining the default scraping interval and timeout.
global:
  scrape_interval:     60s 
  scrape_timeout:      30s 

# List of scrape configurations for different metric jobs.
scrape_configs:
  # Prometheus self-scraping job configuration to monitor its own metrics.
  - job_name: 'prometheus'
    static_configs:
      - targets: ['192.168.1.1:9090']
  # DCGM Exporter scraping job configuration to collect GPU metrics.
  - job_name: 'dcgm_export'
    static_configs:
      - targets: ['192.168.1.1:9400', '192.168.1.2:9400']   
  # Node Exporter scraping job configuration to collect system metrics.
  - job_name: 'node_export'
    static_configs:
      - targets: ['192.168.1.1:9100', '192.168.1.2:9100']

Start Container on Management Node

shell

docker run -d --restart=always -p 9090:9090 --name=prometheus -v /data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Check the Port

shell

sudo lsof -nP -i:9090
# Access the web page by entering:
# http://localhost:9090/graph
# http://localhost:9090/targets

4.5 Installing and Configuring grafana

Pull the Docker image.

shell

docker pull grafana/grafana

Start Container on Management Node

shell

sudo mkdir -p /data/grafana-storage
sudo chmod 777 -R /data/grafana-storage

shell

docker run -d --restart=always -p 3000:3000 --name=grafana -v /data/grafana-storage/:/var/lib/grafana grafana/grafana

shell

sudo lsof -nP -i:3000

Grafana Configuration

Open the web page of Grafana(http://localhost:3000) in your browser and log in with the default username and password: admin/admin.
Connections -> Data Sources -> Add Data Source -> Prometheus(input Prometheus server URL)-> Save & Test.
Dashboards -> New Dashboard -> Import -> ...(ID Template 1860:node-exporter,15117:dcgm-exporter) -> Load.

4. Cluster Monitoring Setup Guide ​

4.1 Cluster environment ​

4.2 Installing and Configuring node-exporter ​

4.3 Installing and Configuring dcgm-exporter ​

4.4 Installing and Configuring Prometheus ​

4.5 Installing and Configuring grafana ​

4. Cluster Monitoring Setup Guide

4.1 Cluster environment

4.2 Installing and Configuring node-exporter

4.3 Installing and Configuring dcgm-exporter

4.4 Installing and Configuring Prometheus

4.5 Installing and Configuring grafana