Project

General

Profile

Actions

Bug #44158

closed

opensuse_15.1 machine in Sepia does not come back from reboot after installing new distro kernel

Added by Nathan Cutler about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Category:
Infrastructure Service
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

With the following teuthology PR I am testing teuthology-suite --kernel distro with os_type: opensuse and os_version: "15.1":

It gets all the way to reboot, but - frustratingly - the teuthology server cannot re-establish SSH connection with it:

I suspect that the opensuse_15.1 FOG image has some special initrd/GRUB configuration that is getting clobbered by the new distro kernel.

Actions #1

Updated by David Galloway about 4 years ago

  • Project changed from teuthology to sepia
  • Category set to Infrastructure Service
  • Assignee set to David Galloway

So if you take a peek at the console log, the machine does come back up running the correct kernel but networking is broken.

http://qa-proxy.ceph.com/teuthology/smithfarm-2020-02-14_17:17:42-custom-master-distro-basic-smithi/4764079/console_logs/smithi205.log

Loading Linux 4.12.14-lp151.28.36-default ...

Loading initial ramdisk ...

[    1.191172] pstore: lzo_decompress error, ret = -6!
[    1.197383] pstore: decompression failed: -5
[    1.202902] pstore: lzo_decompress error, ret = -6!
[    1.209128] pstore: decompression failed: -5
[    1.214644] pstore: lzo_decompress error, ret = -6!
[    1.220872] pstore: decompression failed: -5
[    1.226396] pstore: lzo_decompress error, ret = -6!
[    1.232607] pstore: decompression failed: -5
[    1.238137] pstore: lzo_decompress error, ret = -6!
[    1.244332] pstore: decompression failed: -5
[    1.249847] pstore: lzo_decompress error, ret = -6!
[    1.256050] pstore: decompression failed: -5
+ '[' '!' -f /.cephlab_net_configured ']'
+ set +e
+ attempts=0
+ myips=
+ '[' '' '!=' '' ']'
+ '[' 0 -ge 10 ']'
++ ip -4 addr
++ grep -v '127.0.0.1\|127.0.1.1'
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
+ myips=
+ attempts=1
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 1 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=2
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 2 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=3
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 3 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=4
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 4 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=5
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 5 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=6
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 6 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=7
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 7 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=8
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 8 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=9
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 9 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=
+ attempts=10
+ sleep 1
+ '[' '' '!=' '' ']'
+ '[' 10 -ge 10 ']'
+ set -e
+ '[' -n '' ']'
+ command -v zypper
+ '[' '!' -f /etc/ssh/ssh_host_rsa_key ']'
+ '[' -e /.cephlab_rc_local ']'
+ exit 0

Welcome to openSUSE Leap 15.1 - Kernel 4.12.14-lp151.28.36-default (ttyS1).

smithi205 login: 

I'm not sure why /etc/rc.local is running again though. Here's rc.local output from the first time the machine boots using the FOG image.

Loading Linux 4.12.14-lp151.27-default ...

Loading initial ramdisk ...

[    1.270671] pstore: lzo_decompress error, ret = -6!
[    1.276935] pstore: decompression failed: -5
[    1.282507] pstore: lzo_decompress error, ret = -6!
[    1.288758] pstore: decompression failed: -5
[    1.294312] pstore: lzo_decompress error, ret = -6!
[    1.300587] pstore: decompression failed: -5
[    1.306158] pstore: lzo_decompress error, ret = -6!
[    1.312412] pstore: decompression failed: -5
[    1.317965] pstore: lzo_decompress error, ret = -6!
[    1.324216] pstore: decompression failed: -5
[    1.329767] pstore: lzo_decompress error, ret = -6!
[    1.336019] pstore: decompression failed: -5
+ '[' '!' -f /.cephlab_net_configured ']'
+ udevadm trigger
+ sleep 5
++ ls -1 /sys/class/net
++ grep -v lo
+ nics='eth0
eth1
eth2
eth3'
+ for nic in $nics
+ ifconfig eth0 up
/etc/init.d/boot.local: line 14: ifconfig: command not found
+ ip link set eth0 up
+ sleep 5
[[0;32m  OK  [0m] Started wicked managed network interfaces.
[[0;32m  OK  [0m] Reached target Network.
         Starting NTP client/server...
         Starting OpenSSH Daemon...
         Starting Permit User Sessions...
         Starting Load kdump kernel and initrd...
[[0;32m  OK  [0m] Started Permit User Sessions.
[[0;32m  OK  [0m] Started NTP client/server.
[[0;32m  OK  [0m] Reached target System Time Synchronized.
[[0;32m  OK  [0m] Started Backup of RPM database.
         Starting Postfix Mail Transport Agent...
[[0;32m  OK  [0m] Started Balance block groups on a btrfs filesystem.
[[0;32m  OK  [0m] Started Daily rotation of log files.
[[0;32m  OK  [0m] Started Backup of /etc/sysconfig.
[[0;32m  OK  [0m] Started Scrub btrfs filesystem, verify block checksums.
[[0;32m  OK  [0m] Started Check if mainboard battery is Ok.
[[0;32m  OK  [0m] Started Discard unused blocks once a week.
[[0;32m  OK  [0m] Started Timeline of Snapper Snapshots.
[[0;32m  OK  [0m] Reached target Timers.
         Starting Backup RPM database...
         Starting Backup /etc/sysconfig directory...
         Starting Check if mainboard battery is Ok...
         Starting Discard unused blocks on filesystems from /etc/fstab...
         Starting Rotate log files...
[[0;32m  OK  [0m] Started Check if mainboard battery is Ok.
[[0;32m  OK  [0m] Started Discard unused blocks on filesystems from /etc/fstab.
[[0;32m  OK  [0m] Started Scrub btrfs filesystem, verify block checksums.
[[0;32m  OK  [0m] Started Balance block groups on a btrfs filesystem.
[[0;32m  OK  [0m] Started Rotate log files.
[[0;32m  OK  [0m] Started Backup /etc/sysconfig directory.
[[0;32m  OK  [0m] Started OpenSSH Daemon.
[[0;32m  OK  [0m] Started Update cron periods from /etc/sysconfig/btrfsmaintenance.
[[0;32m  OK  [0m] Started Postfix Mail Transport Agent.
[[0;32m  OK  [0m] Started Command Scheduler.
+ grep -q 'Link detected: yes'
+ ethtool eth0
+ ifconfig eth0 down
/etc/init.d/boot.local: line 53: ifconfig: command not found
+ ip link set eth0 down
+ for nic in $nics
+ ifconfig eth1 up
/etc/init.d/boot.local: line 14: ifconfig: command not found
+ ip link set eth1 up
+ sleep 5
[[0;32m  OK  [0m] Started Backup RPM database.
+ ethtool eth1
+ grep -q 'Link detected: yes'
+ ifconfig eth1 down
/etc/init.d/boot.local: line 53: ifconfig: command not found
+ ip link set eth1 down
+ for nic in $nics
+ ifconfig eth2 up
/etc/init.d/boot.local: line 14: ifconfig: command not found
+ ip link set eth2 up
+ sleep 5
+ ethtool eth2
+ grep -q 'Link detected: yes'
+ ifconfig eth2 down
/etc/init.d/boot.local: line 53: ifconfig: command not found
+ ip link set eth2 down
+ for nic in $nics
+ ifconfig eth3 up
/etc/init.d/boot.local: line 14: ifconfig: command not found
+ ip link set eth3 up
+ sleep 5
[[0;32m  OK  [0m] Started Load kdump kernel and initrd.
+ ethtool eth3
+ grep -q 'Link detected: yes'
+ command -v zypper
+ echo -e 'DEVICE=eth3\nBOOTPROTO=dhcp\nONBOOT=yes'
+ set +e
+ ifdown eth3
wicked: skipping eth3 interface: device is not configured by wicked yet
wicked: ifdown: no matching interfaces
+ ifup eth3
wicked: /etc/sysconfig/network/routes[1]: Cannot create route - unable to find out address family
eth3            up
+ attempts=0
+ ping -I eth3 -nq -c1 172.21.0.11
PING 172.21.0.11 (172.21.0.11) from 172.21.15.205 eth3: 56(84) bytes of data.

--- 172.21.0.11 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.545/0.545/0.545/0.000 ms
+ '[' 0 == 5 ']'
+ touch /.cephlab_net_configured
+ break
+ set +e
+ attempts=0
+ myips=
+ '[' '' '!=' '' ']'
+ '[' 0 -ge 10 ']'
++ ip -4 addr
++ grep -oP '(?<=inet\s)\d+(\.\d+){3}'
++ grep -v '127.0.0.1\|127.0.1.1'
+ myips=172.21.15.205
+ attempts=1
+ sleep 1
+ '[' 172.21.15.205 '!=' '' ']'
+ set -e
+ '[' -n 172.21.15.205 ']'
+ for ip in $myips
+ timeout 1s ping -I 172.21.15.205 -nq -c1 172.21.0.1
++ dig +short -x 172.21.15.205 @172.21.0.1
++ sed 's/\.com.*/\.com/g'
+ newhostname=smithi205.front.sepia.ceph.com
+ '[' -n smithi205.front.sepia.ceph.com ']'
+ hostname smithi205.front.sepia.ceph.com
++ hostname -d
+ newdomain=front.sepia.ceph.com
++ hostname -s
+ shorthostname=smithi205
+ echo smithi205
+ grep -q front.sepia.ceph.com /etc/hosts
+ sed -i 's/.*front.sepia.ceph.com.*/172.21.15.205 smithi205.front.sepia.ceph.com smithi205/g' /etc/hosts
+ break
+ command -v zypper
+ '[' '!' -f /etc/ssh/ssh_host_rsa_key ']'
+ '[' -e /.cephlab_rc_local ']'
+ exit 0
[[0;32m  OK  [0m] Started /etc/init.d/boot.local Compatibility.
[[0;32m  OK  [0m] Started Getty on tty1.
[[0;32m  OK  [0m] Started Serial Getty on ttyS1.
[[0;32m  OK  [0m] Reached target Login Prompts.
[[0;32m  OK  [0m] Reached target Multi-User System.
         Starting Update UTMP about System Runlevel Changes...
[[0;32m  OK  [0m] Started Update UTMP about System Runlevel Changes.

Welcome to openSUSE Leap 15.1 - Kernel 4.12.14-lp151.27-default (ttyS1).

smithi205 login: 

And here's rc.local

#!/bin/bash
# Redirect rc.local output to our console so it's in teuthology console logs
exec 2> /dev/ttyS1
exec 1>&2
set -ex

if [ ! -f /.cephlab_net_configured ]; then
  nics=$(ls -1 /sys/class/net | grep -v lo)

  for nic in $nics; do
    # Bring the NIC up so we can detect if a link is present
    ifconfig $nic up || ip link set $nic up
    # Sleep for a bit to let the NIC come up
    sleep 5
    if ethtool $nic | grep -q "Link detected: yes"; then
      if command -v zypper &>/dev/null; then
        echo -e "DEVICE=$nic\nBOOTPROTO=dhcp\nONBOOT=yes" > /etc/sysconfig/network/ifcfg-$nic
      elif command -v apt-get &>/dev/null; then
        echo -e "auto lo\niface lo inet loopback\n\nauto $nic\niface $nic inet dhcp" > /etc/network/interfaces
      else
        echo -e "DEVICE=$nic\nBOOTPROTO=dhcp\nONBOOT=yes" > /etc/sysconfig/network-scripts/ifcfg-$nic
      fi
      # Don't bail if NIC fails to go down or come up
      set +e
      # Bounce the NIC so it gets a DHCP address
      ifdown $nic
      ifup $nic
      attempts=0
      # Try for 5 seconds to ping our Cobbler host
      while ! ping -I $nic -nq -c1 172.21.0.11 && [ $attempts -lt 5 ]; do
        sleep 1
        attempts=$[$attempts+1]
      done
      if [ $attempts == 5 ]; then
        # If we can't ping our Cobbler host, remove the DHCP config for this NIC.
        # It must either be on a non-routable network or has no reachable DHCP server.
        ifdown $nic
        rm -f /etc/sysconfig/network-scripts/ifcfg-$nic
        sed -i "/$nic/d" /etc/network/interfaces
        # Go back to bailing if anything fails bringing the next NIC up
        set -e
      else
        # We found our routable NIC!
        # Write our lockfile so this only gets run on firstboot
        touch /.cephlab_net_configured
        # Break out of the loop once we've found our routable NIC
        break
      fi
    else
      # Take the NIC back down if it's not connected
      ifconfig $nic down || ip link set $nic down
    fi
  done
fi

# Don't error out if the `ip` command returns rc 1
set +e

attempts=0
myips="" 
until [ "$myips" != "" ] || [ $attempts -ge 10 ]; do
  myips=$(ip -4 addr | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | grep -v '127.0.0.1\|127.0.1.1')
  attempts=$[$attempts+1]
  sleep 1
done

set -e

if [ -n "$myips" ]; then
  for ip in $myips; do
    if timeout 1s ping -I $ip -nq -c1 172.21.0.1 2>&1 >/dev/null; then
      newhostname=$(dig +short -x $ip @172.21.0.1 | sed 's/\.com.*/\.com/g')
        if [ -n "$newhostname" ]; then
          hostname $newhostname
          newdomain=$(hostname -d)
          shorthostname=$(hostname -s)
          echo $shorthostname > /etc/hostname
          if grep -q $newdomain /etc/hosts; then
            # Replace
            sed -i "s/.*$newdomain.*/$ip $newhostname $shorthostname/g" /etc/hosts
          else
            # Or add to top of file
            sed -i '1i'$ip' '$newhostname' '$shorthostname'\' /etc/hosts
          fi
        fi
    # Quit after first IP that can ping our nameserver
    # in the extremely unlikely event the testnode has two IPs
    break
    fi
  done
fi

# Regenerate SSH host keys on boot if needed
if command -v zypper &> /dev/null; then
  if [ ! -f /etc/ssh/ssh_host_rsa_key ]; then
    ssh-keygen -f /etc/ssh/ssh_host_rsa_key -N '' -t rsa
    systemctl restart sshd
  fi
elif command -v apt-get &>/dev/null; then
  if [ ! -f /etc/ssh/ssh_host_rsa_key ]; then
     dpkg-reconfigure openssh-server
  fi
fi

# Only run once.
if [ -e /.cephlab_rc_local ]; then
    exit 0
fi
Actions #2

Updated by David Galloway about 4 years ago

I will lock a smithi manually and try to reproduce.

Actions #3

Updated by David Galloway about 4 years ago

So this has nothing to do with updating the kernel and has something to do with how networking is brought up in OpenSUSE.

rc.local writes a very basic ifcfg config file and it is present on smithi039 after I imaged with FOG, updated the kernel, and rebooted.

smithi039:~ # cat /etc/sysconfig/network/ifcfg-eth2
DEVICE=eth2
BOOTPROTO=dhcp
ONBOOT=yes

smithi039:~ # ifup eth2
wicked: /etc/sysconfig/network/routes[1]: Cannot create route - unable to find out address family
eth2            up

smithi039:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:c4:7a:6b:fe:c6 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:c4:7a:6b:fe:c7 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:c4:7a:bd:13:70 brd ff:ff:ff:ff:ff:ff
    inet 172.21.15.39/20 brd 172.21.15.255 scope global eth2
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:febd:1370/64 scope link 
       valid_lft forever preferred_lft forever

Is wicked the OpenSUSE equivalent of NetworkManager? How do we get that NIC to come up on boot?

Actions #4

Updated by Nathan Cutler about 4 years ago

I'm not sure, but we could try changing this:

      if command -v zypper &>/dev/null; then
        echo -e "DEVICE=$nic\nBOOTPROTO=dhcp\nONBOOT=yes" > /etc/sysconfig/network/ifcfg-$nic

to this:

      if command -v zypper &>/dev/null; then
        echo -e "BOOTPROTO='dhcp'\nSTARTMODE='auto'" > /etc/sysconfig/network/ifcfg-$nic
Actions #5

Updated by Nathan Cutler about 4 years ago

Is wicked the OpenSUSE equivalent of NetworkManager?

Possibly, "nmcli" is to RH as "wicked" is to SUSE.

Actions #6

Updated by David Galloway about 4 years ago

Nathan Cutler wrote:

I'm not sure, but we could try changing this:

[...]

to this:

[...]

I just manually tested this and it works.

https://github.com/ceph/ceph-cm-ansible/pull/530

Actions #7

Updated by David Galloway about 4 years ago

  • Status changed from New to Resolved
Actions #8

Updated by Nathan Cutler about 4 years ago

  • Status changed from Resolved to New

I tried it again, and though the machine does come back from reboot now (yay!), the network still appears to be broken:

http://qa-proxy.ceph.com/teuthology/smithfarm-2020-02-18_08:54:15-custom-master-distro-basic-smithi/4777255/teuthology.log

2020-02-18T09:23:48.545 INFO:teuthology.orchestra.run.smithi022:> sudo shutdown -r now
2020-02-18T09:23:48.547 INFO:teuthology.misc:Re-opening connections...
2020-02-18T09:23:48.548 INFO:teuthology.misc:trying to connect to ubuntu@smithi022.front.sepia.ceph.com
2020-02-18T09:23:48.549 INFO:teuthology.orchestra.remote:Trying to reconnect to host
2020-02-18T09:23:48.550 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi022.front.sepia.ceph.com', 'timeout': 60}
2020-02-18T09:23:48.648 INFO:teuthology.orchestra.run.smithi022:> true
2020-02-18T09:23:48.725 DEBUG:teuthology.misc:waited 0.177124977112
2020-02-18T09:23:49.726 INFO:teuthology.task.kernel:Checking client mon.a for new kernel version...
2020-02-18T09:23:49.726 INFO:teuthology.task.kernel:Checking kernel version of mon.a, want "4.12.14-lp151.28.36.1.x86_64"...
2020-02-18T09:23:49.726 INFO:teuthology.orchestra.run.smithi022:> uname -r

So far so, good! Except somehow we get no output from the command, and after 17 minutes teuthology gives up:

2020-02-18T09:40:32.102 ERROR:paramiko.transport:Socket exception: No route to host (113)
2020-02-18T09:40:32.102 ERROR:teuthology.task.kernel:Saw exception
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/task/kernel.py", line 696, in wait_for_reboot
    assert not need_to_install(ctx, client, need_install[client]), \
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/task/kernel.py", line 184, in need_to_install
    stdout=uname_fp,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/orchestra/cluster.py", line 64, in run
    return [remote.run(**kwargs) for remote in remotes]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/orchestra/remote.py", line 198, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/orchestra/run.py", line 428, in run
    r.execute()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/teuthology/orchestra/run.py", line 98, in execute
    self.client.exec_command(self.command)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/virtualenv/local/lib/python2.7/site-packages/paramiko/client.py", line 508, in exec_command
    chan = self._transport.open_session(timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/virtualenv/local/lib/python2.7/site-packages/paramiko/transport.py", line 879, in open_session
    timeout=timeout,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-kernel-distro-opensuse/virtualenv/local/lib/python2.7/site-packages/paramiko/transport.py", line 1006, in open_channel
    raise e
error: [Errno 113] No route to host
Actions #9

Updated by Nathan Cutler about 4 years ago

  • Status changed from New to Resolved

Confirmed that this bug is fixed in the latest opensuse_15.1 FOG image - see https://github.com/ceph/teuthology/pull/1413#issuecomment-587805351 for details.

Thanks @David !

Actions

Also available in: Atom PDF