Project

General

Profile

Actions

Bug #62050

open

High volume of reimage failures in teuthology suites

Added by Laura Flores 9 months ago. Updated 8 months ago.

Status:
New
Priority:
High
Assignee:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Description of problem

There is a high volume of reimaging failures in teuthology suites. For instance, "Error reimaging machines: Expected smithi172's OS to be ubuntu 20.04 but found centos 8 " and "Error reimaging machines: Cannot connect to remote host smithi097".

How to reproduce

Run some jobs in a teuthology suite- rados, rbd, etc.

Time noticed / Event Window (in UTC)

First reported by Yuri on July 14th, 7:21 AM UTC. The failures have been happening in intermittent time windows since.

In this panel, from July 14th to now, you can see spikes in reimage failures:
https://grafana-route-grafana.apps.os.sepia.ceph.com/d/teuthology/teuthology?orgId=1&from=1689292800000&to=now&refresh=1m&viewPanel=31

Actual results

High volume of reimage failures.

Expected results

Little to no reimage failures.

Additional analysis

I'm seeing a correlation between reimage failures: And load spikes:
Actions #1

Updated by Laura Flores 9 months ago

  • Copied from Bug #62049: High volume of reimage failures in teuthology suites added
Actions #2

Updated by Laura Flores 9 months ago

  • Copied from deleted (Bug #62049: High volume of reimage failures in teuthology suites)
Actions #3

Updated by Laura Flores 9 months ago

Linking a recent rados baseline as an example:
https://pulpito.ceph.com/yuriw-2023-07-14_23:46:34-rados-reef-distro-default-smithi/

63/307 jobs (20%) were affected by reimage failures.

Actions #5

Updated by Zack Cerza 9 months ago

More reimaging failures today. This job saw ipmi seem to completely fail for one of the nodes:
http://qa-proxy.ceph.com/teuthology/yuriw-2023-07-27_03:42:08-fs-wip-yuri10-testing-2023-07-26-0858-pacific-distro-default-smithi/7354256/supervisor.7354256.log

2023-07-27T12:08:19.228 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi169.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power status
2023-07-27T12:08:20.337 DEBUG:teuthology.orchestra.console:check power output: 
2023-07-27T12:08:20.338 WARNING:teuthology.contextutil:'wait for power on' reached maximum tries (16) after waiting for 60.0 seconds

Would be great if we could get to the bottom of the IPMI issues
Edit: To be clear, I mean BMCs and/or switches. I don't believe it's e.g. a bug in ipmitool.

Actions #6

Updated by Zack Cerza 9 months ago

Not sure if anyone's looking at these, but this job had skipped reimages, failed ipmi, and "stale jobs detected":
http://qa-proxy.ceph.com/teuthology/yuriw-2023-07-28_14:25:29-rados-wip-yuri7-testing-2023-07-27-1336-quincy-distro-default-smithi/7355454/

Actions #7

Updated by Zack Cerza 9 months ago

Another "wrong OS" job: https://pulpito-ng.ceph.com/runs/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/jobs/7358284

The supervisor log: http://qa-proxy.ceph.com/teuthology/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/7358284/supervisor.7358284.log
The relevant bits:

2023-08-02T17:17:18.577 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /host {"name": "smithi018"}
2023-08-02T17:17:18.652 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /image {"name": "smithi_ubuntu_20.04"}
2023-08-02T17:17:18.694 DEBUG:teuthology.provision.fog.smithi018:Requesting image smithi_ubuntu_20.04 (ID 2)
2023-08-02T17:17:18.695 DEBUG:teuthology.provision.fog.smithi018:FOG request: PUT /host/124 {"imageID": 2}
2023-08-02T17:17:18.755 INFO:teuthology.provision.fog.smithi018:Scheduling deploy of ubuntu 20.04
2023-08-02T17:17:18.755 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /task/active None
2023-08-02T17:17:18.867 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /tasktype {"name": "deploy"}
2023-08-02T17:17:18.899 DEBUG:teuthology.provision.fog.smithi018:FOG request: POST /host/124/task {"taskTypeID": 1}

Note that the three other nodes in that job were reimaged correctly, and of course followed the exact same process.

The node's console log during the reimage process:
http://qa-proxy.ceph.com/teuthology/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/7358284/console_logs/smithi018_reimage.log
The relevant bits:

note: Storage Location 172.21.15.200:/images/, Image name smithi_rhel_8.6

Fog's reimaging log: http://fog.front.sepia.ceph.com/fog/management/index.php?node=report&sub=file&f=aW1hZ2luZyBsb2c%3D
The relevant bits:

fog    smithi018    2023-08-02 07:44:18    2023-08-02 07:46:16    1 minute 58 seconds    smithi_ubuntu_20.04    Deploy
fog    smithi018    2023-08-02 17:19:03    2023-08-02 17:20:01    58 seconds    smithi_rhel_8.6    Deploy

smithi018's page: http://fog.front.sepia.ceph.com/fog/management/index.php?node=host&sub=edit&id=124
The image is set to "smithi_rhel_8.6 - (5)"

So, we are requesting the correct image, but FOG simply ignores that for some reason.

Actions #8

Updated by Laura Flores 9 months ago

Since it's been long suspected that this is a network-related issue, we discussed in the last Infrastructure meeting the idea of adding additional monitoring (i.e. ICMP monitoring https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#icmp_probe) to help us prove whether the network is really at play here.

Plan for this TBD at the next Infrastructure meeting.

Actions #9

Updated by Zack Cerza 8 months ago

Saw a new (to me) problem yesterday. During a period of high failure rates, I saw several jobs had been "in line" for a reimage for over an hour. That's a FOG-ism where the machines have netbooted into FOG's reimage client, but FOG is too busy to serve them at the moment. I also found some "not enough children"-type messages in a PHP log, of all places, so I found and increased those settings.

Still, even during periods where there were no active reimages, these machines sat around and waited, and eventually their jobs failed after over five hours of waiting. An example:

http://qa-proxy.ceph.com/teuthology/yuriw-2023-08-15_16:56:50-fs-wip-yuri7-testing-2023-08-15-0741-pacific-distro-default-smithi/7368819/console_logs/smithi181_reimage.log
The last lines from the console log:

 * Attempting to check in............................Failed
 * No Active Task found for Host: smithi181 (0c:c4:7a:88:7b:9d) (In line for 5:11:25)

Actions

Also available in: Atom PDF