Project

General

Profile

Actions

Bug #61159

open

AssertionError: failed to install new kernel version within timeout

Added by Laura Flores 12 months ago. Updated 30 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271297

2023-05-11T19:35:34.582 DEBUG:teuthology.task.kernel:Distro of this test job: rhel
2023-05-11T19:35:34.582 ERROR:teuthology.task.kernel:Saw exception
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/task/kernel.py", line 710, in wait_for_reboot
    assert not need_to_install(ctx, client, need_install[client]), \
AssertionError: failed to install new kernel version within timeout


Related issues 5 (1 open4 closed)

Related to teuthology - Bug #43335: AssertionError: failed to install new kernel version within timeoutResolved

Actions
Related to teuthology - Bug #38802: "failed to install new kernel version within timeout" on mimicResolvedDan Mick03/18/2019

Actions
Related to teuthology - Bug #18701: failed to install new kernel versionNew01/27/2017

Actions
Related to teuthology - Bug #12872: failed to install new kernel version within timeoutResolvedAndrew Schoen08/31/2015

Actions
Related to teuthology - Bug #11957: 'AssertionError: failed to install new kernel version within timeout'ResolvedIlya Dryomov06/10/2015

Actions
Actions #2

Updated by Laura Flores 12 months ago

  • Related to Bug #43335: AssertionError: failed to install new kernel version within timeout added
Actions #3

Updated by Laura Flores 12 months ago

  • Related to Bug #38802: "failed to install new kernel version within timeout" on mimic added
Actions #4

Updated by Laura Flores 12 months ago

  • Related to Bug #18701: failed to install new kernel version added
Actions #5

Updated by Laura Flores 12 months ago

  • Related to Bug #12872: failed to install new kernel version within timeout added
Actions #6

Updated by Laura Flores 12 months ago

  • Related to Bug #11957: 'AssertionError: failed to install new kernel version within timeout' added
Actions #7

Updated by Laura Flores 12 months ago

Laura Flores wrote:

/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271297
[...]

Meant to include more of the log:

2023-05-11T19:35:34.563 INFO:teuthology.orchestra.remote:Successfully reconnected to host 'ubuntu@smithi103.front.sepia.ceph.com'
2023-05-11T19:35:34.564 INFO:teuthology.task.kernel:Checking kernel version of mon.b, want "4.18.0-372.32.1.el8_6.x86_64"...
2023-05-11T19:35:34.564 DEBUG:teuthology.orchestra.run.smithi103:> uname -r
2023-05-11T19:35:34.581 INFO:teuthology.orchestra.run.smithi103.stdout:4.18.0-372.9.1.el8.x86_64
2023-05-11T19:35:34.582 DEBUG:teuthology.task.kernel:current kernel version is 4.18.0-372.9.1.el8.x86_64 vs 4.18.0-372.32.1.el8_6.x86_64
2023-05-11T19:35:34.582 DEBUG:teuthology.task.kernel:Distro of this test job: rhel
2023-05-11T19:35:34.582 ERROR:teuthology.task.kernel:Saw exception
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/task/kernel.py", line 710, in wait_for_reboot
    assert not need_to_install(ctx, client, need_install[client]), \
AssertionError: failed to install new kernel version within timeout

Actions #8

Updated by Ilya Dryomov 12 months ago

http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271319/teuthology.log

2023-05-11T19:38:58.798 DEBUG:teuthology.task.kernel:get_latest_image_version_rpm: 4.18.0-372.32.1.el8_6.x86_64
2023-05-11T19:38:58.798 INFO:teuthology.task.kernel:Not newest distro kernel. Current: 4.18.0-372.9.1.el8.x86_64 Expected: 4.18.0-372.32.1.el8_6.x86_64
2023-05-11T19:38:58.798 INFO:teuthology.task.kernel:Skipping firmware on distro kernel
2023-05-11T19:38:58.798 DEBUG:teuthology.task.kernel:src is distro, skipping download
2023-05-11T19:38:58.798 INFO:teuthology.task.kernel:Installing distro kernel on mon.a...
2023-05-11T19:38:58.799 DEBUG:teuthology.task.kernel:install_kernel(remote=ubuntu@smithi189.front.sepia.ceph.com, path=None, version=distro)
2023-05-11T19:38:58.799 DEBUG:teuthology.orchestra.run.smithi189:> rpm -q kernel --last | head -n 1
2023-05-11T19:38:58.936 INFO:teuthology.orchestra.run.smithi189.stdout:kernel-4.18.0-372.32.1.el8_6.x86_64           Thu 11 May 2023 07:38:43 PM UTC
2023-05-11T19:38:58.938 DEBUG:teuthology.task.kernel:get_latest_image_version_rpm: 4.18.0-372.32.1.el8_6.x86_64
2023-05-11T19:38:58.938 DEBUG:teuthology.orchestra.run.smithi189:> sudo rpm -qi grub2-tools
2023-05-11T19:38:58.970 INFO:teuthology.orchestra.run.smithi189.stdout:Name        : grub2-tools
2023-05-11T19:38:58.970 INFO:teuthology.orchestra.run.smithi189.stdout:Epoch       : 1
2023-05-11T19:38:58.971 INFO:teuthology.orchestra.run.smithi189.stdout:Version     : 2.02
2023-05-11T19:38:58.971 INFO:teuthology.orchestra.run.smithi189.stdout:Release     : 123.el8
2023-05-11T19:38:58.971 INFO:teuthology.orchestra.run.smithi189.stdout:Architecture: x86_64
2023-05-11T19:38:58.971 INFO:teuthology.orchestra.run.smithi189.stdout:Install Date: Thu 14 Jul 2022 06:22:19 PM UTC
2023-05-11T19:38:58.971 INFO:teuthology.orchestra.run.smithi189.stdout:Group       : System Environment/Base
2023-05-11T19:38:58.971 INFO:teuthology.orchestra.run.smithi189.stdout:Size        : 9265450
2023-05-11T19:38:58.971 INFO:teuthology.orchestra.run.smithi189.stdout:License     : GPLv3+
2023-05-11T19:38:58.972 INFO:teuthology.orchestra.run.smithi189.stdout:Signature   : RSA/SHA256, Thu 31 Mar 2022 07:24:35 PM UTC, Key ID 199e2f91fd431d51
2023-05-11T19:38:58.972 INFO:teuthology.orchestra.run.smithi189.stdout:Source RPM  : grub2-2.02-123.el8.src.rpm
2023-05-11T19:38:58.972 INFO:teuthology.orchestra.run.smithi189.stdout:Build Date  : Mon 28 Mar 2022 02:12:09 PM UTC
2023-05-11T19:38:58.972 INFO:teuthology.orchestra.run.smithi189.stdout:Build Host  : x86-vm-09.build.eng.bos.redhat.com
2023-05-11T19:38:58.972 INFO:teuthology.orchestra.run.smithi189.stdout:Relocations : (not relocatable)
2023-05-11T19:38:58.972 INFO:teuthology.orchestra.run.smithi189.stdout:Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
2023-05-11T19:38:58.972 INFO:teuthology.orchestra.run.smithi189.stdout:Vendor      : Red Hat, Inc.
2023-05-11T19:38:58.973 INFO:teuthology.orchestra.run.smithi189.stdout:URL         : http://www.gnu.org/software/grub/
2023-05-11T19:38:58.973 INFO:teuthology.orchestra.run.smithi189.stdout:Summary     : Support tools for GRUB.
2023-05-11T19:38:58.973 INFO:teuthology.orchestra.run.smithi189.stdout:Description :
2023-05-11T19:38:58.973 INFO:teuthology.orchestra.run.smithi189.stdout:
2023-05-11T19:38:58.973 INFO:teuthology.orchestra.run.smithi189.stdout:The GRand Unified Bootloader (GRUB) is a highly configurable and
2023-05-11T19:38:58.973 INFO:teuthology.orchestra.run.smithi189.stdout:customizable bootloader with modular architecture.  It supports a rich
2023-05-11T19:38:58.973 INFO:teuthology.orchestra.run.smithi189.stdout:variety of kernel formats, file systems, computer architectures and
2023-05-11T19:38:58.974 INFO:teuthology.orchestra.run.smithi189.stdout:hardware devices.
2023-05-11T19:38:58.974 INFO:teuthology.orchestra.run.smithi189.stdout:
2023-05-11T19:38:58.974 INFO:teuthology.orchestra.run.smithi189.stdout:This subpackage provides tools for support of all platforms.
2023-05-11T19:38:58.975 INFO:teuthology.task.kernel:Updating Grub Version: grub2
2023-05-11T19:38:58.975 INFO:teuthology.task.kernel:Updating grub on smithi189 to boot 4.18.0-372.32.1.el8_6.x86_64
2023-05-11T19:38:58.975 DEBUG:teuthology.orchestra.run.smithi189:> sudo grub2-mkconfig -o /boot/grub2/grub.cfg
2023-05-11T19:39:00.548 INFO:teuthology.orchestra.run.smithi189.stderr:Generating grub configuration file ...
2023-05-11T19:39:02.380 INFO:teuthology.orchestra.run.smithi189.stderr:done
2023-05-11T19:39:02.381 DEBUG:teuthology.orchestra.run.smithi189:> mktemp
2023-05-11T19:39:02.398 INFO:teuthology.orchestra.run.smithi189.stdout:/tmp/tmp.7krGLTVuFj
2023-05-11T19:39:02.399 DEBUG:teuthology.orchestra.run.smithi189:> sudo cp /boot/grub2/grub.cfg /tmp/tmp.7krGLTVuFj
2023-05-11T19:39:02.481 DEBUG:teuthology.orchestra.run.smithi189:> sudo chmod 0666 /tmp/tmp.7krGLTVuFj
2023-05-11T19:39:02.641 DEBUG:teuthology.orchestra.remote:smithi189:/tmp/tmp.7krGLTVuFj is 6KB
2023-05-11T19:39:02.689 DEBUG:teuthology.orchestra.run.smithi189:> rm -fr /tmp/tmp.7krGLTVuFj
2023-05-11T19:39:02.705 DEBUG:teuthology.orchestra.run.smithi189:> sudo /bin/ls /boot/loader/entries || true
2023-05-11T19:39:02.794 INFO:teuthology.orchestra.run.smithi189.stdout:5a84457b7b52416e95d47d02e2e4a60f-0-rescue.conf
2023-05-11T19:39:02.795 INFO:teuthology.orchestra.run.smithi189.stdout:5a84457b7b52416e95d47d02e2e4a60f-4.18.0-372.9.1.el8.x86_64.conf
2023-05-11T19:39:02.796 WARNING:teuthology.task.kernel:Unable to update grub2 order
2023-05-11T19:41:08.578 INFO:teuthology.task.kernel:Checking kernel version of mon.a, want "4.18.0-372.32.1.el8_6.x86_64"...
2023-05-11T19:41:08.578 DEBUG:teuthology.orchestra.run.smithi189:> uname -r
2023-05-11T19:41:08.596 INFO:teuthology.orchestra.run.smithi189.stdout:4.18.0-372.9.1.el8.x86_64
2023-05-11T19:41:08.596 DEBUG:teuthology.task.kernel:current kernel version is 4.18.0-372.9.1.el8.x86_64 vs 4.18.0-372.32.1.el8_6.x86_64

I think the "Unable to update grub2 order" error is the culprit. The node gets rebooted into the same kernel (one it was running before the reboot) and the new kernel version check just does its job...

Actions #9

Updated by Laura Flores 12 months ago

Thanks Ilya! Looks like the teuthology task is failing around here:
https://github.com/ceph/teuthology/blob/69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/task/kernel.py#L942-L961

    if '\nmenuentry ' not in grub2conf:
        # okay, do the newer (el8) grub2 thing
        grub2conf = remote.sh('sudo /bin/ls /boot/loader/entries || true')
        entry = None
        for line in grub2conf.split(null):
            if line.endswith('.conf') and newversion in line:
                entry = line[:-5]  # drop .conf suffix
                break
    else:
        # do old menuitem counting thing
        for line in grub2conf.split(null):
            if line.startswith('menuentry '):
                if newversion in line:
                    break
                entry_num += 1
        entry = str(entry_num)
    if entry is None:
        log.warning('Unable to update grub2 order')
    else:
        remote.run(args=['sudo', grubset, entry])

Actions #10

Updated by Dan Mick 12 months ago

Looks like "unable to update" refers to the case where the grub2 menu update code can't find the requested kernel to set as the default kernel (as though the kernel install failed)

Actions #11

Updated by Dan Mick 12 months ago

Turn over a rock...

The fundamental problem is that, while the kernel package was installed, its postinstall script that updates /boot/loader/entries, which is what builds the grub menu, was silently failing. It turns out it was silently failing because it demands to have a machine-id to use for the filename in /boot/loader/entries (for some reason; I don't know how there could be any confusion about which machine you're booting, but, there you go), and code added in January to ceph-cm-ansible by https://github.com/ceph/ceph-cm-ansible/pull/717 was deleting and not restoring /etc/machine-id. So fix 1 is recreating /etc/machine-id (and /var/lib/dbus/machine-id) so that the kernel package install works.

Fix 2 is that when more than one kernel package exists, the "rpm -q --last kernel" command is used to identify the newest kernel (based on installation time), but since the packages may well have just been reinstalled by "yum reinstall kernel", there's no guarantee which package was installed first. What we want is the latest version kernel, which we can get with rpm -q kernel | sort -V.

However, a newly-provisioned RHEL machine cannot install kernel packages because it hasn't yet subscribed to a package server (like our local Satellite instance). So the ansible roles have to run on the host before they can successfully install a new package. However, the ansible roles aren't run before the kernel task executes.

So far, I haven't been able to untangle what ought to be happening to resolve the ordering above.

Actions #12

Updated by Dan Mick 12 months ago

  • Assignee set to Dan Mick
Actions #13

Updated by Dan Mick 12 months ago

A bit more info:

The situation only occurs when 1) a RHEL machine is able to get at its subscription-manager repo, and 2) there is a later kernel package available. This isn't normally the case; it was occurring because of some kind of confusion in the provisioning process that allowed a host to get past the lock operation without actually reimaging. In one case the job was requesting ubuntu, but it got a host running RHEL 8.6 from a past job.

We're looking at making the lock operation validate the requested distro/version, but this is a "shouldn't happen" path. There's still a thought that there's something not quite right with our reimaging paths; failures are happening and it's not yet clear why, and this was fallout from a particularly-specific sort of failure.

Actions #14

Updated by Dan Mick 11 months ago

https://github.com/ceph/teuthology/pull/1843 addresses the "wrong sorting of rpm -q results" aspect.

Actions #15

Updated by Dan Mick 11 months ago

https://github.com/ceph/teuthology/pull/1851 handles the "verify installed OS version". We've had one known failure so far.

Actions #16

Updated by Laura Flores 30 days ago

/a/teuthology-2024-03-29_02:08:11-upgrade-squid-distro-default-smithi/7629101

Actions

Also available in: Atom PDF