Skip to content

User Guide for podsys-gbxx

TIP

podsys gbxx 2204 Based on Ubuntu Server 22.04.5 LTS (ARM64)
podsys gbxx 2404 Based on Ubuntu Server 24.04.4 LTS (ARM64)
podsys gbxx Support GB200 or later (NVIDIA Grace Blackwell)

1. Environment Requirements for the Management Node

TIP

Select one machine as the management node and the rest as compute nodes.
All PODsys deployment operations are performed on the management node.

1.1 Use a Node with an Existing Operating System

If you have an existing operating system management node, you need to perform the following steps:

  • Create a new user: nexus.
shell
sudo adduser nexus
sudo usermod -aG sudo nexus
sudo -su nexus
  • Install Docker Engine.
  • Install PDSH

1.2 Use a freshly installed management node

  • The management node supports both ARM and x86 CPU architectures.

  • Manually install Ubuntu Server on the management node through BMC or USB, with the following settings:

yaml
version : 22.04.5 or 24.04.4
username : nexus

1.3 Download the podsys software package

Go to Download page to download the podsys-gbxx-2404.tgz or podsys-gbxx-2204.tgz. Verify the md5sum:

shell
echo "ee0c5d4ad4d828db270c6d3d8babaef8 podsys-gbxx-2404.tgz" | md5sum  --check
sudo tar -xzvf podsys-gbxx-2404.tgz -C /home/nexus/
shell
echo "4b6e9b5f8e2faec18a8a25660615bc55 podsys-gbxx-2204.tgz" | md5sum  --check
sudo tar -xzvf podsys-gbxx-2204.tgz -C /home/nexus/
  • Download iso
shell
# Download iso to podsys-gbxx-2404/workspace/
cd podsys-gbxx-2404/workspace/
wget https://cdimage.ubuntu.com/releases/24.04.4/release/ubuntu-24.04.4-live-server-arm64.iso

# Download cuda to podsys-gbxx-2404/workspace/drivers/
cd podsys-gbxx-2404/workspace/drivers/ 
wget https://developer.download.nvidia.com/compute/cuda/13.0.3/local_installers/cuda_13.0.3_580.126.20_linux_sbsa.run
sudo chmod 755 cuda_13.0.3_580.126.20_linux_sbsa.run
shell
# Download iso to podsys-gbxx-2204/workspace/
cd podsys-gbxx-2204/workspace/
wget https://cdimage.ubuntu.com/releases/22.04.5/release/ubuntu-22.04.5-live-server-arm64.iso

# Download cuda to podsys-gbxx-2204/workspace/drivers/
cd podsys-gbxx-2204/workspace/drivers/
wget https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda_13.0.2_580.95.05_linux_sbsa.run
sudo chmod 755 cuda_13.0.2_580.95.05_linux_sbsa.run
  • Run install_manager.sh(if the management node is also a GB200|GB300 tray)
shell

cd podsys-gbxx-2404/
sudo ./install_manager.sh
  • After reboot, run the following command to check the health of the software stack.
shell
sudo ./scripts/health-checks.sh

2. Steps to Deploy the Compute Nodes

The installation steps for the compute node are to be executed on the management node.

2.1 Modifying config.yaml

shell
cd podsys-gbxx-2404
sudo vim workspace/config.yaml

Contents that needs to be modified:

yaml
manager_ip: 192.168.0.1
manager_nic: ens6f0
dhcp_s: 192.168.0.1
dhcp_e: 192.168.0.200
compute_passwd: your_passwd
compute_storage: MIN_NVME
compute_username: nexus
  • manager_ip: The IP address of the management node.
  • manager_nic: The NIC identifier of the management node.
  • dhcp_s & dhcp_e: During network installation, the range of IP addresses allocated for compute nodes.

WARNING

The number of IP addresses required is greater than the number of compute nodes.

Example

  • Management node IP: 192.168.0.201/16.
  • Compute nodes (200 units) are set in the iplist.txt as 192.168.0.1/24 to 192.168.0.200/24.
  • You can set dhcp_s to 192.168.1.1 and dhcp_e to 192.168.3.254.
  • This ensures that the DHCP range does not conflict with the static IPs allocated for compute nodes in iplist.txt, maintains connectivity within the range of IP addresses including the management node, and provides sufficient IP addresses within the range.
  • After installation, you can modify 192.168.0.201/16 to 192.168.0.201/24.
  • compute_passwd: The user password for the compute node.
  • compute_storage: The installation location of the compute node system(e.g., sda, sdb, nvme0n1).
    you also can use MIN_SATA MAX_SATA MIN_NVMEor MAX_NVME .
    For GBxx use MIN_NVME.

WARNING

This will overwrite the original data on the Disk.

  • compute_username: The user name for the compute node.

2.2 Modifying ct_os_iplist.txt

shell
sudo vim workspace/ct_os_iplist.txt

Example of "ct_os_iplist.txt":

Serial NumberhostnameIP addressgatewayDNS
24X01node01192.168.0.1/24192.168.0.18.8.8.8
24X02node02192.168.0.2/24nonenone
24X03node03192.168.0.3/24nonenone
Tips

Firstly, you can prepare your data in a consistent format within an Excel spreadsheet, then directly copy it into a blank ct_os_iplist.txt file opened with Vim.

cat -A ct_os_iplist.txt

The correct output should resemble:

24X01^Inode01^I192.168.0.1/24^Inone^Inone$  
24X02^Inode02^I192.168.0.2/24^Inone^Inone$

Here, ^I represents a tab character used to separate different fields within each line, and $ denotes the end of a line.

DANGER

Don't forget the subnet mask when dealing with IPs.

2.3 Running install_compute.sh

WARNING

Running the following script will start some services based on the previously filled configuration information. If you have modified config.yaml and ct_os_iplist.txt, you need to exit the script and re-run it.

shell
sudo ./install_compute.sh

Then you will access an interactive command-line interface within a running Docker:

such as:
**
 Welcome to the cluster deployment software
Configuring, please wait...
  ____     ___    ____    ____   __   __  ____
 |  _ \   / _ \  |  _ \  / ___|  \ \ / / / ___|
 | |_) | | | | | | | | | \___ \   \ V /  \___ \
 |  __/  | |_| | | |_| |  ___) |   | |    ___) |
 |_|      \___/  |____/  |____/    |_|   |____/

dhcp-config : /etc/dnsmasq.conf
user-data   : /user-data/user-data

starting services:
 * Starting DNS forwarder and DHCP server dnsmasq   [ OK ]
 * Starting nginx nginx                             [ OK ]

Running Monitor on following URLs:
 * http://192.168.0.201:5000

root@podsys:/ $

Preparation is complete, and the compute node awaiting system installation is now powered on.

TIP

  • Start the compute nodes without an installed operating system to initiate automatic installation.
  • If the compute nodes already have an operating system, they need to be placed in PXE (Preboot Execution Environment) mode.
  • If the compute node fails to enter PXE installation, check whether the firewall on the management node is disabled.

2.4 Monitor the installation progress

You can check the installation progress by accessing http://manager_ip:5000.

Details

example

TIP

  • A for each item indicates that the detection is normal, while a signifies that the hardware component is missing.
  • When a specific log name appears in the logs, you can click to view the log details.
  • An exclamation mark (!) in the disk column suggests that the disk specified in the config.yaml file does not exist on the node.

After the installation is complete, the compute nodes will reboot automatically.

If you want to test the performance of the compute nodes 3. Performance Testing Don't exit the command-line interface.

The testing log will be shown in the log on the monitor interface.

If you want to exit the command-line interface, type "exit" and press Enter. Then, you can finish the installation of the compute node.

2.5 Check the installation result

Once all compute nodes have been restarted, run:
This command can be executed multiple times until all compute nodes are successfully SSHed into.

shell
sudo ./config_server.sh -pre

INFO

  • In the podsys-gbxx folder, generate a host.txt file. This file should contain the IP addresses of all nodes that can be successfully accessed via SSH. This is important for subsequent parallel environment configuration.
  • This will ensure passwordless SSH login among the nexus users across all nodes.
  • After the hosts.txt file is fully generated, restart all installed compute nodes (otherwise, the 3.8 deployment check may fail).
shell
pdsh -l root -R ssh -w ^hosts.txt "date"
pdsh -l root -R ssh -w ^hosts.txt "reboot"

Run the following command to check the installation result:

shell
sudo ./config_client.sh -health-check
shell
172.16.0.1: PODsys deployment successful
172.16.0.2: PODsys deployment successful
172.16.0.3: PODsys deployment successful
...

Copyright © 2025 The PODsys Project. All rights reserved.