6. PODsys Metrics

podsys-metrics is a high-performance system monitoring tool for real-time collection of CPU, memory, disk, network (Ethernet/InfiniBand), and NVIDIA GPU performance metrics.

6.1 Features

CPU Monitoring: User mode, system mode, IO wait, idle time utilization statistics
Memory Monitoring: Memory usage statistics
Disk Monitoring: Disk read/write I/O statistics
Network Monitoring:
- Ethernet interface send/receive statistics
- InfiniBand port traffic statistics
GPU Monitoring: NVIDIA GPU utilization, memory usage, temperature, power consumption, frequency, etc.
CUDA Information: GPU device details and compute capability detection

6.2 Usage on Compute Nodes

After deploying compute nodes using PODsys, the podsys-metrics binary is located at /podsys/scripts/podsys-metrics on all compute nodes.

Note

By default, the daemon mode is not enabled on compute nodes after installation. You can use it directly on individual compute nodes, or start the daemon mode for cluster monitoring.

Standalone Usage on Compute Node

bash

# Monitor GPU, JSON output, run 100 times
/podsys/scripts/podsys-metrics gpu -j -d 100

# Monitor CPU and memory simultaneously
/podsys/scripts/podsys-metrics cpu mem

# Monitor all modules
/podsys/scripts/podsys-metrics all

Command Options

Option	Description
`-j, --json`	Output in JSON format
`-d, --duration <num>`	Specify monitoring iterations (default: 9999)
`-h, --help`	Display help information

6.3 Cluster Monitoring

Supports cluster mode monitoring, with a server managing multiple client nodes.

Start Daemon on Compute Nodes

Use PDSH to start the daemon on all compute nodes from the management node:

bash

# Start daemon on all nodes (background mode)
pdsh -R ssh -w ^hosts.txt "/podsys/scripts/podsys-metrics daemon -p 9001 -f"

# Or start without background (for testing)
pdsh -R ssh -w ^hosts.txt "/podsys/scripts/podsys-metrics daemon -p 9001"

Note

hosts.txt contains the IP addresses of all compute nodes (generated by install_progress.sh)
Default daemon port is 9001
Use -f flag to run in background (fork mode)

Server Startup

The server-side Python script is located at podsys-gbxx/scripts/metrics_server.py on the management node.

Start the server from the management node:

bash

cd podsys-gbxx/
python3 scripts/metrics_server.py -i hosts.txt <modules> [-d duration]

Parameters:

Parameter	Description
`-i, --iplist`	Required, node IP list file path
`<modules>`	Required, monitoring modules (cpu, mem, gpu, disk, eth, ib), multiple can be specified
`-d, --duration`	Monitoring iteration count (default: 1)
`--server-port`	Server listening port (default 9000, for receiving client connections)
`--client-port`	Client listening port (default 9001, for server to actively connect to clients)
`--host`	Server bind address (default 0.0.0.0)

Usage Examples

bash

# Monitor CPU, memory, GPU for 10 iterations
python3 metrics_server.py -i hosts.txt cpu mem gpu -d 10

# Monitor GPU
python3 metrics_server.py -i hosts.txt gpu -d 10

Output Format

Monitoring data is displayed in table format, with data collected from all nodes output uniformly after each round:

[00:37:08] Metrics from 8 nodes
---------------------------------------------------------------------------
Node       GPU                    ID  Util   Mem-Used     Temp    Power
---------------------------------------------------------------------------
cu11       NVIDIA B300            0    0%  949/294912MB    36C   300.0W
cu11       NVIDIA B300            1    0%  946/294912MB    35C   300.0W
cu11       NVIDIA B300            2    0%  946/294912MB    35C   300.0W
...

Features:

Table-aligned display for easy comparison of node data
Multi-GPU nodes display all graphics card information
Sorted by node and GPU ID
Supports simultaneous monitoring of multiple modules (CPU, memory, disk, network, GPU)

6. PODsys Metrics ​

6.1 Features ​

6.2 Usage on Compute Nodes ​

Standalone Usage on Compute Node ​

Command Options ​

6.3 Cluster Monitoring ​

Start Daemon on Compute Nodes ​

Server Startup ​

Usage Examples ​

Output Format ​

6. PODsys Metrics

6.1 Features

6.2 Usage on Compute Nodes

Standalone Usage on Compute Node

Command Options

6.3 Cluster Monitoring

Start Daemon on Compute Nodes

Server Startup

Usage Examples

Output Format