Appearance
4. Cluster Monitoring Setup Guide
INFO
A cluster built on PODsys can facilitate the configuration of a parallel environment. This section uses an example of setting up a cluster monitoring environment based on Docker. Setting up cluster monitoring with Docker offers the following advantages:
- Consistent Environment
- Rapid Deployment
- Resource Isolation
- Easy to Scale, Update, and Maintain
4.1 Cluster environment
| Docker Image | Management Node | Compute Node | Role | Port Number |
|---|---|---|---|---|
| grafana | ✓ | Responsible for displaying the Prometheus monitoring interface | 3000 | |
| Prometheus | ✓ | Main server of Prometheus | 9090 | |
| dcgm-exporter | ✓ | ✓ | Collects GPU hardware information and operating system information | 9400 |
| node-exporter | ✓ | ✓ | Collects host hardware information and operating system information | 9100 |
Add the nexus user to the docker group
To avoid permission issues when using Docker subsequently, you need to add the user:nexus to the docker group.
shell
# Run on the management node
sudo usermod -aG docker nexus
newgrp dockershell
# Run on the management node
pdsh -l root -R ssh -w ^hosts.txt "usermod -aG docker nexus"TIP
The hosts.txt file records the IP addresses of all accessible compute nodes.
It needs to be generated through sudo ./config_server.sh -pre.
4.2 Installing and Configuring node-exporter
Introduction to node-exporter
In the Prometheus monitoring system, node_exporter is a monitoring service used for collecting and exposing machine-level system data. By exposing an interface that Prometheus can regularly scrape to obtain various metric data about the machine, node_exporter plays a vital role in the monitoring setup. It gathers a wide range of system-level metrics including CPU utilization, memory usage, disk space, network traffic, and more.
The operation of node_exporter involves running a separate process on the machine that periodically collects these metrics, which are then exposed to Prometheus. Prometheus sends regular requests to the node_exporter endpoint to fetch updated metric data. This data can be used by Prometheus for monitoring, alerting, analysis, and visualization purposes.
By utilizing node_exporter, you can efficiently collect and monitor system-level metric data from your machines, providing insight into their operational status and performance conditions. Moreover, since node_exporter is an open-source project, users have the flexibility to customize and extend it according to specific monitoring requirements.
Execute the following commands on the management node:
- Pull the Docker image and save.
shell
docker pull prom/node-exporter
docker save -o /home/nexus/podsys-bxx/node-exporter.tar prom/node-exporter- Distribute the Image to Compute Nodes
shell
pdcp -R ssh -w ^hosts.txt /home/nexus/podsys-bxx/node-exporter.tar /podsys/node-exporter.tar- Load Images on Compute Nodes
shell
pdsh -R ssh -w ^hosts.txt "docker load -i /podsys/node-exporter.tar"- Start Container
shell
# Run on the management node
docker run -d --name=node_exporter --net="host" --pid=host -v "/:/host:ro,rslave" prom/node-exporter:latest --path.rootfs=/hostshell
# Run on the management node
pdsh -R ssh -w ^hosts.txt "docker run -d --name=node_exporter --net="host" --pid=host -v "/:/host:ro,rslave" prom/node-exporter:latest --path.rootfs=/host"4.3 Installing and Configuring dcgm-exporter
Introduction to dcgm-exporter
DCGM (Data Center GPU Manager) is a set of tools for managing and monitoring NVIDIA GPUs in data center cluster environments, offering proactive health monitoring, comprehensive diagnostics, system alerts, and power/clock management. The dcgm-exporter integrates with Prometheus to collect and expose GPU metrics from NVIDIA devices.
- Pull the Docker image and save.
shell
docker pull nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
docker save -o /home/nexus/podsys-bxx/dcgm-exporter.tar nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04- Distribute the Image to Compute Nodes
shell
pdcp -R ssh -w ^hosts.txt /home/nexus/podsys-bxx/dcgm-exporter.tar /podsys/dcgm-exporter.tar- Load Images on Compute Nodes
shell
pdsh -R ssh -w ^hosts.txt "docker load -i /podsys/dcgm-exporter.tar"- Start Container
shell
# Run on the management node
docker run -d --gpus all --cap-add SYS_ADMIN --rm --name=dcgm-exporter -p 9400:9400 nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04shell
# Run on the management node
pdsh -R ssh -w ^hosts.txt "docker run -d --gpus all --cap-add SYS_ADMIN --rm --name=dcgm-exporter -p 9400:9400 nvidia/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04"4.4 Installing and Configuring Prometheus
- Pull the Docker image.
shell
docker pull prom/prometheus- touch prometheus.yml
shell
sudo mkdir -p /data/prometheus
sudo touch /data/prometheus/prometheus.yml
sudo chmod 777 -R /data/prometheusprometheus.yml
yaml
# Global configuration, defining the default scraping interval and timeout.
global:
scrape_interval: 60s
scrape_timeout: 30s
# List of scrape configurations for different metric jobs.
scrape_configs:
# Prometheus self-scraping job configuration to monitor its own metrics.
- job_name: 'prometheus'
static_configs:
- targets: ['192.168.1.1:9090']
# DCGM Exporter scraping job configuration to collect GPU metrics.
- job_name: 'dcgm_export'
static_configs:
- targets: ['192.168.1.1:9400', '192.168.1.2:9400']
# Node Exporter scraping job configuration to collect system metrics.
- job_name: 'node_export'
static_configs:
- targets: ['192.168.1.1:9100', '192.168.1.2:9100']- Start Container on Management Node
shell
docker run -d --restart=always -p 9090:9090 --name=prometheus -v /data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus- Check the Port
shell
sudo lsof -nP -i:9090
# Access the web page by entering:
# http://localhost:9090/graph
# http://localhost:9090/targets4.5 Installing and Configuring grafana
- Pull the Docker image.
shell
docker pull grafana/grafana- Start Container on Management Node
shell
sudo mkdir -p /data/grafana-storage
sudo chmod 777 -R /data/grafana-storageshell
docker run -d --restart=always -p 3000:3000 --name=grafana -v /data/grafana-storage/:/var/lib/grafana grafana/grafanashell
sudo lsof -nP -i:3000Grafana Configuration
- Open the web page of Grafana(
http://localhost:3000) in your browser and log in with the default username and password: admin/admin. - Connections -> Data Sources -> Add Data Source -> Prometheus(input Prometheus server URL)-> Save & Test.
- Dashboards -> New Dashboard -> Import -> ...(ID Template 1860:node-exporter,15117:dcgm-exporter) -> Load.