Bug #44158
closedopensuse_15.1 machine in Sepia does not come back from reboot after installing new distro kernel
0%
Description
With the following teuthology PR I am testing teuthology-suite --kernel distro
with os_type: opensuse
and os_version: "15.1"
:
It gets all the way to reboot, but - frustratingly - the teuthology server cannot re-establish SSH connection with it:
I suspect that the opensuse_15.1 FOG image has some special initrd/GRUB configuration that is getting clobbered by the new distro kernel.
Updated by David Galloway about 4 years ago
- Project changed from teuthology to sepia
- Category set to Infrastructure Service
- Assignee set to David Galloway
So if you take a peek at the console log, the machine does come back up running the correct kernel but networking is broken.
Loading Linux 4.12.14-lp151.28.36-default ... Loading initial ramdisk ... [ 1.191172] pstore: lzo_decompress error, ret = -6! [ 1.197383] pstore: decompression failed: -5 [ 1.202902] pstore: lzo_decompress error, ret = -6! [ 1.209128] pstore: decompression failed: -5 [ 1.214644] pstore: lzo_decompress error, ret = -6! [ 1.220872] pstore: decompression failed: -5 [ 1.226396] pstore: lzo_decompress error, ret = -6! [ 1.232607] pstore: decompression failed: -5 [ 1.238137] pstore: lzo_decompress error, ret = -6! [ 1.244332] pstore: decompression failed: -5 [ 1.249847] pstore: lzo_decompress error, ret = -6! [ 1.256050] pstore: decompression failed: -5 + '[' '!' -f /.cephlab_net_configured ']' + set +e + attempts=0 + myips= + '[' '' '!=' '' ']' + '[' 0 -ge 10 ']' ++ ip -4 addr ++ grep -v '127.0.0.1\|127.0.1.1' ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' + myips= + attempts=1 + sleep 1 + '[' '' '!=' '' ']' + '[' 1 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=2 + sleep 1 + '[' '' '!=' '' ']' + '[' 2 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=3 + sleep 1 + '[' '' '!=' '' ']' + '[' 3 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=4 + sleep 1 + '[' '' '!=' '' ']' + '[' 4 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=5 + sleep 1 + '[' '' '!=' '' ']' + '[' 5 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=6 + sleep 1 + '[' '' '!=' '' ']' + '[' 6 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=7 + sleep 1 + '[' '' '!=' '' ']' + '[' 7 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=8 + sleep 1 + '[' '' '!=' '' ']' + '[' 8 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=9 + sleep 1 + '[' '' '!=' '' ']' + '[' 9 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips= + attempts=10 + sleep 1 + '[' '' '!=' '' ']' + '[' 10 -ge 10 ']' + set -e + '[' -n '' ']' + command -v zypper + '[' '!' -f /etc/ssh/ssh_host_rsa_key ']' + '[' -e /.cephlab_rc_local ']' + exit 0 Welcome to openSUSE Leap 15.1 - Kernel 4.12.14-lp151.28.36-default (ttyS1). smithi205 login:
I'm not sure why /etc/rc.local
is running again though. Here's rc.local output from the first time the machine boots using the FOG image.
Loading Linux 4.12.14-lp151.27-default ... Loading initial ramdisk ... [ 1.270671] pstore: lzo_decompress error, ret = -6! [ 1.276935] pstore: decompression failed: -5 [ 1.282507] pstore: lzo_decompress error, ret = -6! [ 1.288758] pstore: decompression failed: -5 [ 1.294312] pstore: lzo_decompress error, ret = -6! [ 1.300587] pstore: decompression failed: -5 [ 1.306158] pstore: lzo_decompress error, ret = -6! [ 1.312412] pstore: decompression failed: -5 [ 1.317965] pstore: lzo_decompress error, ret = -6! [ 1.324216] pstore: decompression failed: -5 [ 1.329767] pstore: lzo_decompress error, ret = -6! [ 1.336019] pstore: decompression failed: -5 + '[' '!' -f /.cephlab_net_configured ']' + udevadm trigger + sleep 5 ++ ls -1 /sys/class/net ++ grep -v lo + nics='eth0 eth1 eth2 eth3' + for nic in $nics + ifconfig eth0 up /etc/init.d/boot.local: line 14: ifconfig: command not found + ip link set eth0 up + sleep 5 [[0;32m OK [0m] Started wicked managed network interfaces. [[0;32m OK [0m] Reached target Network. Starting NTP client/server... Starting OpenSSH Daemon... Starting Permit User Sessions... Starting Load kdump kernel and initrd... [[0;32m OK [0m] Started Permit User Sessions. [[0;32m OK [0m] Started NTP client/server. [[0;32m OK [0m] Reached target System Time Synchronized. [[0;32m OK [0m] Started Backup of RPM database. Starting Postfix Mail Transport Agent... [[0;32m OK [0m] Started Balance block groups on a btrfs filesystem. [[0;32m OK [0m] Started Daily rotation of log files. [[0;32m OK [0m] Started Backup of /etc/sysconfig. [[0;32m OK [0m] Started Scrub btrfs filesystem, verify block checksums. [[0;32m OK [0m] Started Check if mainboard battery is Ok. [[0;32m OK [0m] Started Discard unused blocks once a week. [[0;32m OK [0m] Started Timeline of Snapper Snapshots. [[0;32m OK [0m] Reached target Timers. Starting Backup RPM database... Starting Backup /etc/sysconfig directory... Starting Check if mainboard battery is Ok... Starting Discard unused blocks on filesystems from /etc/fstab... Starting Rotate log files... [[0;32m OK [0m] Started Check if mainboard battery is Ok. [[0;32m OK [0m] Started Discard unused blocks on filesystems from /etc/fstab. [[0;32m OK [0m] Started Scrub btrfs filesystem, verify block checksums. [[0;32m OK [0m] Started Balance block groups on a btrfs filesystem. [[0;32m OK [0m] Started Rotate log files. [[0;32m OK [0m] Started Backup /etc/sysconfig directory. [[0;32m OK [0m] Started OpenSSH Daemon. [[0;32m OK [0m] Started Update cron periods from /etc/sysconfig/btrfsmaintenance. [[0;32m OK [0m] Started Postfix Mail Transport Agent. [[0;32m OK [0m] Started Command Scheduler. + grep -q 'Link detected: yes' + ethtool eth0 + ifconfig eth0 down /etc/init.d/boot.local: line 53: ifconfig: command not found + ip link set eth0 down + for nic in $nics + ifconfig eth1 up /etc/init.d/boot.local: line 14: ifconfig: command not found + ip link set eth1 up + sleep 5 [[0;32m OK [0m] Started Backup RPM database. + ethtool eth1 + grep -q 'Link detected: yes' + ifconfig eth1 down /etc/init.d/boot.local: line 53: ifconfig: command not found + ip link set eth1 down + for nic in $nics + ifconfig eth2 up /etc/init.d/boot.local: line 14: ifconfig: command not found + ip link set eth2 up + sleep 5 + ethtool eth2 + grep -q 'Link detected: yes' + ifconfig eth2 down /etc/init.d/boot.local: line 53: ifconfig: command not found + ip link set eth2 down + for nic in $nics + ifconfig eth3 up /etc/init.d/boot.local: line 14: ifconfig: command not found + ip link set eth3 up + sleep 5 [[0;32m OK [0m] Started Load kdump kernel and initrd. + ethtool eth3 + grep -q 'Link detected: yes' + command -v zypper + echo -e 'DEVICE=eth3\nBOOTPROTO=dhcp\nONBOOT=yes' + set +e + ifdown eth3 wicked: skipping eth3 interface: device is not configured by wicked yet wicked: ifdown: no matching interfaces + ifup eth3 wicked: /etc/sysconfig/network/routes[1]: Cannot create route - unable to find out address family eth3 up + attempts=0 + ping -I eth3 -nq -c1 172.21.0.11 PING 172.21.0.11 (172.21.0.11) from 172.21.15.205 eth3: 56(84) bytes of data. --- 172.21.0.11 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.545/0.545/0.545/0.000 ms + '[' 0 == 5 ']' + touch /.cephlab_net_configured + break + set +e + attempts=0 + myips= + '[' '' '!=' '' ']' + '[' 0 -ge 10 ']' ++ ip -4 addr ++ grep -oP '(?<=inet\s)\d+(\.\d+){3}' ++ grep -v '127.0.0.1\|127.0.1.1' + myips=172.21.15.205 + attempts=1 + sleep 1 + '[' 172.21.15.205 '!=' '' ']' + set -e + '[' -n 172.21.15.205 ']' + for ip in $myips + timeout 1s ping -I 172.21.15.205 -nq -c1 172.21.0.1 ++ dig +short -x 172.21.15.205 @172.21.0.1 ++ sed 's/\.com.*/\.com/g' + newhostname=smithi205.front.sepia.ceph.com + '[' -n smithi205.front.sepia.ceph.com ']' + hostname smithi205.front.sepia.ceph.com ++ hostname -d + newdomain=front.sepia.ceph.com ++ hostname -s + shorthostname=smithi205 + echo smithi205 + grep -q front.sepia.ceph.com /etc/hosts + sed -i 's/.*front.sepia.ceph.com.*/172.21.15.205 smithi205.front.sepia.ceph.com smithi205/g' /etc/hosts + break + command -v zypper + '[' '!' -f /etc/ssh/ssh_host_rsa_key ']' + '[' -e /.cephlab_rc_local ']' + exit 0 [[0;32m OK [0m] Started /etc/init.d/boot.local Compatibility. [[0;32m OK [0m] Started Getty on tty1. [[0;32m OK [0m] Started Serial Getty on ttyS1. [[0;32m OK [0m] Reached target Login Prompts. [[0;32m OK [0m] Reached target Multi-User System. Starting Update UTMP about System Runlevel Changes... [[0;32m OK [0m] Started Update UTMP about System Runlevel Changes. Welcome to openSUSE Leap 15.1 - Kernel 4.12.14-lp151.27-default (ttyS1). smithi205 login:
And here's rc.local
#!/bin/bash # Redirect rc.local output to our console so it's in teuthology console logs exec 2> /dev/ttyS1 exec 1>&2 set -ex if [ ! -f /.cephlab_net_configured ]; then nics=$(ls -1 /sys/class/net | grep -v lo) for nic in $nics; do # Bring the NIC up so we can detect if a link is present ifconfig $nic up || ip link set $nic up # Sleep for a bit to let the NIC come up sleep 5 if ethtool $nic | grep -q "Link detected: yes"; then if command -v zypper &>/dev/null; then echo -e "DEVICE=$nic\nBOOTPROTO=dhcp\nONBOOT=yes" > /etc/sysconfig/network/ifcfg-$nic elif command -v apt-get &>/dev/null; then echo -e "auto lo\niface lo inet loopback\n\nauto $nic\niface $nic inet dhcp" > /etc/network/interfaces else echo -e "DEVICE=$nic\nBOOTPROTO=dhcp\nONBOOT=yes" > /etc/sysconfig/network-scripts/ifcfg-$nic fi # Don't bail if NIC fails to go down or come up set +e # Bounce the NIC so it gets a DHCP address ifdown $nic ifup $nic attempts=0 # Try for 5 seconds to ping our Cobbler host while ! ping -I $nic -nq -c1 172.21.0.11 && [ $attempts -lt 5 ]; do sleep 1 attempts=$[$attempts+1] done if [ $attempts == 5 ]; then # If we can't ping our Cobbler host, remove the DHCP config for this NIC. # It must either be on a non-routable network or has no reachable DHCP server. ifdown $nic rm -f /etc/sysconfig/network-scripts/ifcfg-$nic sed -i "/$nic/d" /etc/network/interfaces # Go back to bailing if anything fails bringing the next NIC up set -e else # We found our routable NIC! # Write our lockfile so this only gets run on firstboot touch /.cephlab_net_configured # Break out of the loop once we've found our routable NIC break fi else # Take the NIC back down if it's not connected ifconfig $nic down || ip link set $nic down fi done fi # Don't error out if the `ip` command returns rc 1 set +e attempts=0 myips="" until [ "$myips" != "" ] || [ $attempts -ge 10 ]; do myips=$(ip -4 addr | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | grep -v '127.0.0.1\|127.0.1.1') attempts=$[$attempts+1] sleep 1 done set -e if [ -n "$myips" ]; then for ip in $myips; do if timeout 1s ping -I $ip -nq -c1 172.21.0.1 2>&1 >/dev/null; then newhostname=$(dig +short -x $ip @172.21.0.1 | sed 's/\.com.*/\.com/g') if [ -n "$newhostname" ]; then hostname $newhostname newdomain=$(hostname -d) shorthostname=$(hostname -s) echo $shorthostname > /etc/hostname if grep -q $newdomain /etc/hosts; then # Replace sed -i "s/.*$newdomain.*/$ip $newhostname $shorthostname/g" /etc/hosts else # Or add to top of file sed -i '1i'$ip' '$newhostname' '$shorthostname'\' /etc/hosts fi fi # Quit after first IP that can ping our nameserver # in the extremely unlikely event the testnode has two IPs break fi done fi # Regenerate SSH host keys on boot if needed if command -v zypper &> /dev/null; then if [ ! -f /etc/ssh/ssh_host_rsa_key ]; then ssh-keygen -f /etc/ssh/ssh_host_rsa_key -N '' -t rsa systemctl restart sshd fi elif command -v apt-get &>/dev/null; then if [ ! -f /etc/ssh/ssh_host_rsa_key ]; then dpkg-reconfigure openssh-server fi fi # Only run once. if [ -e /.cephlab_rc_local ]; then exit 0 fi
Updated by David Galloway about 4 years ago
I will lock a smithi manually and try to reproduce.
Updated by David Galloway about 4 years ago
So this has nothing to do with updating the kernel and has something to do with how networking is brought up in OpenSUSE.
rc.local writes a very basic ifcfg config file and it is present on smithi039 after I imaged with FOG, updated the kernel, and rebooted.
smithi039:~ # cat /etc/sysconfig/network/ifcfg-eth2 DEVICE=eth2 BOOTPROTO=dhcp ONBOOT=yes smithi039:~ # ifup eth2 wicked: /etc/sysconfig/network/routes[1]: Cannot create route - unable to find out address family eth2 up smithi039:~ # ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 0c:c4:7a:6b:fe:c6 brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 0c:c4:7a:6b:fe:c7 brd ff:ff:ff:ff:ff:ff 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 0c:c4:7a:bd:13:70 brd ff:ff:ff:ff:ff:ff inet 172.21.15.39/20 brd 172.21.15.255 scope global eth2 valid_lft forever preferred_lft forever inet6 fe80::ec4:7aff:febd:1370/64 scope link valid_lft forever preferred_lft forever
Is wicked the OpenSUSE equivalent of NetworkManager? How do we get that NIC to come up on boot?
Updated by Nathan Cutler about 4 years ago
I'm not sure, but we could try changing this:
if command -v zypper &>/dev/null; then echo -e "DEVICE=$nic\nBOOTPROTO=dhcp\nONBOOT=yes" > /etc/sysconfig/network/ifcfg-$nic
to this:
if command -v zypper &>/dev/null; then echo -e "BOOTPROTO='dhcp'\nSTARTMODE='auto'" > /etc/sysconfig/network/ifcfg-$nic
Updated by Nathan Cutler about 4 years ago
Is wicked the OpenSUSE equivalent of NetworkManager?
Possibly, "nmcli" is to RH as "wicked" is to SUSE.
Updated by David Galloway about 4 years ago
Nathan Cutler wrote:
I'm not sure, but we could try changing this:
[...]
to this:
[...]
I just manually tested this and it works.
Updated by Nathan Cutler about 4 years ago
- Status changed from Resolved to New
I tried it again, and though the machine does come back from reboot now (yay!), the network still appears to be broken:
2020-02-18T09:23:48.545 INFO:teuthology.orchestra.run.smithi022:> sudo shutdown -r now 2020-02-18T09:23:48.547 INFO:teuthology.misc:Re-opening connections... 2020-02-18T09:23:48.548 INFO:teuthology.misc:trying to connect to ubuntu@smithi022.front.sepia.ceph.com 2020-02-18T09:23:48.549 INFO:teuthology.orchestra.remote:Trying to reconnect to host 2020-02-18T09:23:48.550 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi022.front.sepia.ceph.com', 'timeout': 60} 2020-02-18T09:23:48.648 INFO:teuthology.orchestra.run.smithi022:> true 2020-02-18T09:23:48.725 DEBUG:teuthology.misc:waited 0.177124977112 2020-02-18T09:23:49.726 INFO:teuthology.task.kernel:Checking client mon.a for new kernel version... 2020-02-18T09:23:49.726 INFO:teuthology.task.kernel:Checking kernel version of mon.a, want "4.12.14-lp151.28.36.1.x86_64"... 2020-02-18T09:23:49.726 INFO:teuthology.orchestra.run.smithi022:> uname -r
So far so, good! Except somehow we get no output from the command, and after 17 minutes teuthology gives up:
2020-02-18T09:40:32.102 ERROR:paramiko.transport:Socket exception: No route to host (113) 2020-02-18T09:40:32.102 ERROR:teuthology.task.kernel:Saw exception Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/task/kernel.py", line 696, in wait_for_reboot assert not need_to_install(ctx, client, need_install[client]), \ File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/task/kernel.py", line 184, in need_to_install stdout=uname_fp, File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/orchestra/cluster.py", line 64, in run return [remote.run(**kwargs) for remote in remotes] File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/orchestra/remote.py", line 198, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/orchestra/run.py", line 428, in run r.execute() File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/orchestra/run.py", line 98, in execute self.client.exec_command(self.command) File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 508, in exec_command chan = self._transport.open_session(timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/virtualenv/local/lib/python2.7/site-packages/paramiko/transport.py", line 879, in open_session timeout=timeout, File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/virtualenv/local/lib/python2.7/site-packages/paramiko/transport.py", line 1006, in open_channel raise e error: [Errno 113] No route to host
Updated by Nathan Cutler about 4 years ago
- Status changed from New to Resolved
Confirmed that this bug is fixed in the latest opensuse_15.1 FOG image - see https://github.com/ceph/teuthology/pull/1413#issuecomment-587805351 for details.
Thanks @David !