Appearance
7. Known Issues and Troubleshooting
Important Note
In the event of encountering issues while operating PODsys, consult this section for comprehensive troubleshooting guidance.
7.1 Compute Node Failure to Initiate PXE Installation
1. Enter the interactive command line in the Docker container
shell
cat /workspace/log/dnsmasq.logCheck if the DHCP service has assigned IP addresses to the compute nodes
2. Compute nodes have not been assigned IP addresses
Check if the firewall on the management node is enabled. If it is enabled, please disable it.
The range specified in the config.yaml file from dhcp_s to dhcp_e might be too small. Exit the container, modify the config.yaml file, and restart the container.
3. Compute nodes have been assigned IPs but file download errors occur
Check whether the IP address of the network interface on the management node is in the same subnet as dhcp_s - dhcp_e.
If it shows NBP filesize is 0 Bytes, possible causes include switch VLAN division or other configurations preventing file transfer, or the network card is damaged or does not support PXE.
Checking Method
Directly connect the problematic compute node to the management node.
7.2 Compute Node Installation Halt
View the installation status of the compute nodes through the PODsys monitoring interface. If the compute node detects that the hard disk is not present (Disk shows ✗), please check the machine.
If the hard disk already shows ✓, but the installation gets stuck, it might be because there is another system on a non-target disk of the node. For example: the target is sda, but there is a system on sdb. The reason could be a name conflict with ubuntu-vg-1.
Solution
- Delete the system on the non-target disk of the node.
- Modify ubuntu-vg-1 to ubuntu-vg-2 in /user-data/user-data.
7.3 Exception Handling After Deployment
- The NVIDIA driver version and the nv-fabricmanager version should be consistent. If you manually upgrade the driver, do not forget to also upgrade nv-fabricmanager accordingly.