Troubleshooting

From SystemImager

Jump to: navigation, search

This is by no means complete and probably contains some inaccuracies as well. I'd rather like to look at it as a starting point where those who know the right words will change mine as well as add new sections as they come up.

It may make more sense to actually break this page into multiple ones for easier navigation. Finally, I've put in several headers with no text, some because I have nothing to say and some because it'd be nice to see others contribute. --mark

Contents

Introduction

The process of generating an image and installing it on one or more other systems consists of a number of steps, any of which can result in failures or unexpected results. This page is intended as a starting point to identifying and resolving those problems. It is not a comprehensive guide but rather something which will grow over time, so a problem not documented here today may be tomorrow.

[I'm going to focus on installations for now. --mark]

Language settings

SystemImager commands support only english-like languages (i.e. LANG=C, LANG=en_US.UTF-8, etc.). Be sure to configure your system (typically the image server) to use an appropriate english language.

See also:

Preparing a Golden Image

Cloning a SuSE10 golden client with SystemImager

SuSE has an option that allows by default a persistent mapping of the network interfaces naming. When distributing the image in other target clients, the name of the network interfaces will be shifted.

To change this behaviour the file /etc/sysconfig/network/config must be modified as following:

 ... 
 ## Type:        yesno
 ## Default:     yes
 #
 # Forces all interfaces eth* ath* wlan* and ra* to be persistent via udev.
 # See /usr/share/doc/package/sysconfig/README.Persistent_Interface_Names for
 # details.
 #
 FORCE_PERSISTENT_NAMES=no
 ...

Retrieving a Golden Image

Could not retrieve image with si_getimage

  • have you started si_prepareclient in your golden client?
  • is your golden client reachable from the image server (check with a ping)?
  • check your firewall on the golden client: port 873 TCP must be opened from your image server in your management LAN

IO errors encountered?

  • rsync returns the error: "IO error encountered -- skipping file deletion"

This usually depends by some unaccessible directories on the client, like mountpoints with no medium found, etc. For example in SuSE this is what usually happens:

 Retrieving image suse10 from 172.16.36.128
 ------------- suse10 IMAGE RETRIEVAL PROGRESS -------------
 rsync: opendir "/media/floppy" (in root) failed: No medium found (123)
 receiving file list ... done
 IO error encountered -- skipping file deletion

To fix this problem, simply exclude the retrieving of the faulty directories, using --exclude or --exclude-file options (for more informations see man si_getimage).

Installing a Golden Image

The first steps in understanding how to troubleshoot Golden Image Installation problems is to understand the many steps this process goes through, determine which step you're having problems with and then proceed to address those problems. The following is a list of the major steps.

  • boot the client in PXE mode, causing dhcp to request an address. You should see dhcp requesting an address [need example]
  • the client receives an address as well as the name of the binary to load (pxelinux.bin for ia32 and x8664 systems)
  • the binary image is loaded which then requests kernel and initrd.img to be downloaded from the /tftpboot directory on the image server
  • the image begins running BusyBox linux which in turn requests BOEL be loaded via rsync, bittorrent or flamethrower
  • a host's files are also loaded via rsync [I think] and by using the name associated with the IP address assigned to the client, an attempt is made to download a script by the name of 'host'.sh rsync if it exists, otherwise a script by the name of 'image'.master is download and started
  • the master install script then goes through several steps, the main ones being:
    • copy/execute the pre-install scripts [I didn't see this in my scripts]
    • stop any RAID devices
    • reproduce the partition structure on the local disks as generated by 'si_prepareclient' and stored in /etc/systemimager/autoinstallscript.conf in the image
    • download the image and overrides using rsync, bittorrent or flamethrower
    • install the boot loader using systemconfigurator
    • copy/execute the post-install scripts

Ramdisk too small? Getting errors?

  • if you get errors while uncompressing initrd, like the following:
 RAMDISK: Compres iso_blknum=17, block=-2147483648
 No filesystem could mount root, tried: ext2 iso9660
 Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(1,0)

Try to modify the blocksize of the ramdisk, adding the option ramdisk_blocksize=1024 to the kernel boot parameters. This error has been reported in RHEL5 and CentOS5 using UYOK.

  • if you get errors about accessing beyond the end of the device, like the following:
 RAMDISK: incomplete write (-28 != 32768) 16777216
 VFS: Mounted root (XXX filesystem) readonly.
 Freeing unused kernel memory: 192k freed
 attempt to access beyond end of device

Add the option ramdisk_size=80000 (or greater) to the kernel boot parameters (typically in /etc/systemimager/pxelinux.cfg/syslinux.cfg or /etc/systemimager/pxelinux.cfg/defaulton your image server). You can either add that to only the systemimager label or all the labels, but by default the systemimager label is used.

If you're using UYOK use the value suggested by si_prepareclient (look at the output of the command).

FATAL: Kernel too old

This error can happen with UYOK when your golden client has an old 2.4 kernel. In fact the glibc into the standard initrd.img (shipped with the SystemImager bin packages) are built using the linux-2.6 headers and for this reason your 2.4 kernel could be incompatible.

To resolve you should build SystemImager from source using your golden client (or a machine with the same kernel and glibc) as build machine, or try to use a different boot package (UYOK) that uses a 2.6 kernel.

See also http://www.mail-archive.com/sisuite-users@lists.sourceforge.net/msg04105.html

Don't see DHCP request

  • is node configured to pxe boot?

You can see the client request an address but it never receives one

  • if a firewall preventing the request from getting through?
  • on the Image Server run dhcp interactively so you can watch the requests: "dhcpd -d -f". When the client requests an address it announced its MAC address. You should see a request by that MAC address on the Image Server. If not, the dhcpd is not listening on the correct interface and/or the physical network may misconfigured. If dhcpd reports 'no free leases', check /etc/dhcpd.conf to make sure it is properly configured to generate an address for requests from that particular client.

The client never starts running pxelinux.bin (x86/x86_64) or elilo.efi (ia64)

  • Is the tftpd server running? Is pxelinux (or elilo) in /tftpboot?
  • is the filename present and correct in /etc/dhcpd.conf? The filename should be given relative to /tftpboot/.

The client can't find kernel and/or initrd.img

  • is it in /tftpboot?
  • does /etc/pxelinux.cfg/default or the appropriate link file in /etc/pxelinux.cfg exist and/or contain references to it?

Your hardware is not well supported?

If you have problems with particular hardware (disk controllers, network cards, etc) and some devices are not properly recognized, try to use UYOK feature and boot with the kernel you'll use for the production.

Problems with BOEL/UYOK kernels

  • does the kernel reboot or panic during the installation?
    • try to run the kernel in "safe mode" adding the following options to the boot parameters (usually in /etc/systemimager/pxelinux.cfg/syslinux.cfg):
 LABEL systemimager
 KERNEL kernel
 APPEND <your_options> ide=nodma apm=off acpi=off noapic edd=off

Problems with UYOK monolithic kernels (without modules support) or self-compiled kernels

  • Be sure the following options are enabled in your kernel:
 CONFIG_PACKET=y
 CONFIG_BLK_DEV_RAM=y
 CONFIG_BLK_DEV_INITRD=y
 CONFIG_BLK_DEV_LOOP=y

You don't see BusyBox start

  • is the kernel/initrd.img valid for this architecture?

Too much I/O errors on /dev/fd0?

  • some servers that do not have a floppy controller may report a lot of I/O errors and slow down the installation when trying to load a local.cfg from floppy. If you're not using a local.cfg you can skip this step adding SKIP_LOCAL_CFG=y to the kernel boot parameters (usually defined in /etc/systemimager/pxelinux.cfg/syslinux.cfg).

BOEL binaries can't be loaded

  • firewall preventing it?
  • is BOEL installed in /usr/share/systemimager/boot/xxx/standard/ where xxx is the appropriate CPU architecture for the client

Hostname cannot be determined for client

  • is there an entry in hosts file [directory?] for this hosts IP address?

Installation script cannot be found

  • check /var/lib/systemimager/scripts
  • check /etc/systemimager/rsyncd.conf

Installation Script Problems

  • Can't stop RAID devices
  • Pre-Install Script(s) didn't execute
  • Partitioning Failed
    • make sure failing device really exists
    • make sure partition sized valid. Problem with /etc/systemimager/autoinstallscript.conf in orignal image?
  • SystemConfigurator Failed
  • Post-Install Script(s) didn't execute

Debugging Tips/Tricks for Installation Script

  • since this is running on top of Linux, you can manually execute any command the script can (but you need to be chroot'd into /a/)
  • you can pipe output to a file and copy that file over to a remote system for further analysis using rsync System Configurator Problems can be tricky to track down
  • if SystemConfigurator fails or gives some error check if you are using a stable release. At the moment the stable release of SystemConfigurator is 2.2.2 available here. Otherwise try with the release maintained in OSCAR, last stable release is available here.

Boot loader problems

  • add the --verbose switch to systemconfigurator command to provide additional information. If you prefer, you can edit this into installation script.
  • to verify a boot record is really written, run the command 'hexdump -C /dev/sda | less' substituting your boot device for 'sda'. You should see the string GRUB somewhere around line 0x170.

Problems on PPC64 architectures

There are some issues to be resolved with PPC4 in versions <= 3.7.5. To manually fix those problems look here.

Moreover the zImage is not created automatically. To create it from the kernel+initrd.img generated by si_prepareclient (or si_mkbootpackage) use the command mkzimage:

 # mkzimage
 Usage: /usr/bin/mkzimage <vmlinuz>|no <config>|no <sysmap>|no <initrd>|no <zImage.stub> <output>
 # mkzimage kernel no initrd.img no zImage.stub-2.6.9-42.EL output-zImage

Note: the initrd.img should not be too large; otherwise try to use the option --my-modules with si_prepareclient.

Network Interface Naming

Debian 4.0 (Etch)

The latest release of Debian 4.0 has a issue which was not evident in older releases. Imaging of a client will result in the network interfaces not be configured correctly i.e they do not exist. This is due to the file /etc/udev/rules.d/z25_persistent-net.rules which contains persistent device names specific to the mac addresses of the network card found on the golden client.

e.g.

  1. PCI device 0x14e4:0x1659 (tg3)

SUBSYSTEM=="net", DRIVERS=="?*", ATTRS{address}=="00:18:8b:f7:e2:fa", NAME="eth0"

  1. PCI device 0x14e4:0x1659 (tg3)

SUBSYSTEM=="net", DRIVERS=="?*", ATTRS{address}=="00:18:8b:f7:e2:fb", NAME="eth1"

The simplest solution is to delete the existing entries, in the actual image, or edit this file with correct MAC addresses and driver.

Ubuntu

The most time consuming was not a SI problem per-se, but a problem in the recent Ubuntu packages where the system reorders the ethernet adapter names upon reboot. For instance on a node that has two ethernet adapters, on the install the ethernet adapters show up as eth0 and eth1, but after the reboot they are eth2 and eth3 with it being more or less random which one is the one attached to the network (the one with the lowest mac address).

To fix this problem simply remove the entries in /etc/iftab, that contains static eth<->mac mappings. For example a bad (for SystemImager) /etc/iftab looks like the following:

 # This file assigns persistent names to network interfaces.
 # See iftab(5) for syntax.
 
 eth0 mac 00:01:6c:e9:66:82 arp 1
 eth1 mac 00:13:ce:8a:55:1c arp 1

A good /etc/iftab is:

 # This file assigns persistent names to network interfaces.
 # See iftab(5) for syntax.

A possible solution to fix network interface naming

The best solution we've found to fix the network interface naming problem is to bind the interfaces to the PCI bus address. Binding to the mac address is not a valid solution for a big cluster, because requires a lot of work to keep the MACs up-to-date with a lot of clients.

Here is an example of an udev config rule:

 # cat /etc/udev/rules.d/60-net.rules
 ACTION=="add", SUBSYSTEM=="net", BUS=="pci", ID=="0000:07:00.0", NAME="eth0"
 ACTION=="add", SUBSYSTEM=="net", BUS=="pci", ID=="0000:07:00.1", NAME="eth1"
 ACTION=="add", SUBSYSTEM=="net", BUS=="pci", ID=="0000:04:04.0", NAME="eth2"
 ACTION=="add", SUBSYSTEM=="net", BUS=="pci", ID=="0000:04:04.1", NAME="eth3"
 SUBSYSTEM=="net", RUN+="/etc/sysconfig/network-scripts/net.hotplug"

With this configuration the network interface installed at the PCI address 0000:07:00.0 will be always named eth0, the network card at 0000:07:00.1 will be always called eth1, etc. This is valid for all the clients that have the same hardware.

If you have clients with different hardware you can simply discover the opportune PCI bus addresses and manage the different configurations via overrides (for example).

To discover the PCI address for a network interface just use udevinfo. For example:

 # udevinfo -a -p /sys/class/net/eth0
 ...
 looking at parent device '/devices/pci0000:00/0000:00:0a.0/0000:07:00.0':
   ID=="0000:07:00.0" <--- here is the PCI address!!!
   BUS=="pci"
   DRIVER=="e1000"
 ...
Personal tools