Appearance
6. PODsys Metrics
podsys-metrics is a high-performance system monitoring tool for real-time collection of CPU, memory, disk, network (Ethernet/InfiniBand), and NVIDIA GPU performance metrics.
6.1 Features
- CPU Monitoring: User mode, system mode, IO wait, idle time utilization statistics
- Memory Monitoring: Memory usage statistics
- Disk Monitoring: Disk read/write I/O statistics
- Network Monitoring:
- Ethernet interface send/receive statistics
- InfiniBand port traffic statistics
- GPU Monitoring: NVIDIA GPU utilization, memory usage, temperature, power consumption, frequency, etc.
- CUDA Information: GPU device details and compute capability detection
6.2 Usage on Compute Nodes
After deploying compute nodes using PODsys, the podsys-metrics binary is located at /podsys/scripts/podsys-metrics on all compute nodes.
Note
By default, the daemon mode is not enabled on compute nodes after installation. You can use it directly on individual compute nodes, or start the daemon mode for cluster monitoring.
Standalone Usage on Compute Node
Log in to any compute node and run directly:
bash
# Monitor GPU, JSON output, run 100 times
/podsys/scripts/podsys-metrics gpu -j -d 100
# Monitor CPU and memory simultaneously
/podsys/scripts/podsys-metrics cpu mem
# Monitor all modules
/podsys/scripts/podsys-metrics allCommand Options
| Option | Description |
|---|---|
-j, --json | Output in JSON format |
-d, --duration <num> | Specify monitoring iterations (default: 9999) |
-h, --help | Display help information |
6.3 Cluster Monitoring
Supports cluster mode monitoring, with a server managing multiple client nodes.
Start Daemon on Compute Nodes
Use PDSH to start the daemon on all compute nodes from the management node:
bash
# Start daemon on all nodes (background mode)
pdsh -R ssh -w ^hosts.txt "/podsys/scripts/podsys-metrics daemon -p 9001 -f"
# Or start without background (for testing)
pdsh -R ssh -w ^hosts.txt "/podsys/scripts/podsys-metrics daemon -p 9001"Note
hosts.txtcontains the IP addresses of all compute nodes (generated byinstall_progress.sh)- Default daemon port is 9001
- Use
-fflag to run in background (fork mode)
Server Startup
The server-side Python script is located at podsys-gbxx/scripts/metrics_server.py on the management node.
Start the server from the management node:
bash
cd podsys-gbxx/
python3 scripts/metrics_server.py -i hosts.txt <modules> [-d duration]Parameters:
| Parameter | Description |
|---|---|
-i, --iplist | Required, node IP list file path |
<modules> | Required, monitoring modules (cpu, mem, gpu, disk, eth, ib), multiple can be specified |
-d, --duration | Monitoring iteration count (default: 1) |
--server-port | Server listening port (default 9000, for receiving client connections) |
--client-port | Client listening port (default 9001, for server to actively connect to clients) |
--host | Server bind address (default 0.0.0.0) |
Usage Examples
bash
# Monitor CPU, memory, GPU for 10 iterations
python3 metrics_server.py -i hosts.txt cpu mem gpu -d 10
# Monitor GPU
python3 metrics_server.py -i hosts.txt gpu -d 10Output Format
Monitoring data is displayed in table format, with data collected from all nodes output uniformly after each round:
[00:37:08] Metrics from 8 nodes
---------------------------------------------------------------------------
Node GPU ID Util Mem-Used Temp Power
---------------------------------------------------------------------------
cu11 NVIDIA B300 0 0% 949/294912MB 36C 300.0W
cu11 NVIDIA B300 1 0% 946/294912MB 35C 300.0W
cu11 NVIDIA B300 2 0% 946/294912MB 35C 300.0W
...Features:
- Table-aligned display for easy comparison of node data
- Multi-GPU nodes display all graphics card information
- Sorted by node and GPU ID
- Supports simultaneous monitoring of multiple modules (CPU, memory, disk, network, GPU)