Skip to content

6. PODsys Metrics

podsys-metrics is a high-performance system monitoring tool for real-time collection of CPU, memory, disk, network (Ethernet/InfiniBand), and NVIDIA GPU performance metrics.

6.1 Features

  • CPU Monitoring: User mode, system mode, IO wait, idle time utilization statistics
  • Memory Monitoring: Memory usage statistics
  • Disk Monitoring: Disk read/write I/O statistics
  • Network Monitoring:
    • Ethernet interface send/receive statistics
    • InfiniBand port traffic statistics
  • GPU Monitoring: NVIDIA GPU utilization, memory usage, temperature, power consumption, frequency, etc.
  • CUDA Information: GPU device details and compute capability detection

6.2 Usage on Compute Nodes

After deploying compute nodes using PODsys, the podsys-metrics binary is located at /podsys/scripts/podsys-metrics on all compute nodes.

Note

By default, the daemon mode is not enabled on compute nodes after installation. You can use it directly on individual compute nodes, or start the daemon mode for cluster monitoring.

Standalone Usage on Compute Node

Log in to any compute node and run directly:

bash
# Monitor GPU, JSON output, run 100 times
/podsys/scripts/podsys-metrics gpu -j -d 100

# Monitor CPU and memory simultaneously
/podsys/scripts/podsys-metrics cpu mem

# Monitor all modules
/podsys/scripts/podsys-metrics all

Command Options

OptionDescription
-j, --jsonOutput in JSON format
-d, --duration <num>Specify monitoring iterations (default: 9999)
-h, --helpDisplay help information

6.3 Cluster Monitoring

Supports cluster mode monitoring, with a server managing multiple client nodes.

Start Daemon on Compute Nodes

Use PDSH to start the daemon on all compute nodes from the management node:

bash
# Start daemon on all nodes (background mode)
pdsh -R ssh -w ^hosts.txt "/podsys/scripts/podsys-metrics daemon -p 9001 -f"

# Or start without background (for testing)
pdsh -R ssh -w ^hosts.txt "/podsys/scripts/podsys-metrics daemon -p 9001"

Note

  • hosts.txt contains the IP addresses of all compute nodes (generated by install_progress.sh)
  • Default daemon port is 9001
  • Use -f flag to run in background (fork mode)

Server Startup

The server-side Python script is located at podsys-gbxx/scripts/metrics_server.py on the management node.

Start the server from the management node:

bash
cd podsys-gbxx/
python3 scripts/metrics_server.py -i hosts.txt <modules> [-d duration]

Parameters:

ParameterDescription
-i, --iplistRequired, node IP list file path
<modules>Required, monitoring modules (cpu, mem, gpu, disk, eth, ib), multiple can be specified
-d, --durationMonitoring iteration count (default: 1)
--server-portServer listening port (default 9000, for receiving client connections)
--client-portClient listening port (default 9001, for server to actively connect to clients)
--hostServer bind address (default 0.0.0.0)

Usage Examples

bash
# Monitor CPU, memory, GPU for 10 iterations
python3 metrics_server.py -i hosts.txt cpu mem gpu -d 10

# Monitor GPU
python3 metrics_server.py -i hosts.txt gpu -d 10

Output Format

Monitoring data is displayed in table format, with data collected from all nodes output uniformly after each round:

[00:37:08] Metrics from 8 nodes
---------------------------------------------------------------------------
Node       GPU                    ID  Util   Mem-Used     Temp    Power
---------------------------------------------------------------------------
cu11       NVIDIA B300            0    0%  949/294912MB    36C   300.0W
cu11       NVIDIA B300            1    0%  946/294912MB    35C   300.0W
cu11       NVIDIA B300            2    0%  946/294912MB    35C   300.0W
...

Features:

  • Table-aligned display for easy comparison of node data
  • Multi-GPU nodes display all graphics card information
  • Sorted by node and GPU ID
  • Supports simultaneous monitoring of multiple modules (CPU, memory, disk, network, GPU)

Copyright © 2025 The PODsys Project. All rights reserved.