Bug #62050: High volume of reimage failures in teuthology suites - Infrastructure - Ceph

Actions

Copy link

Bug #62050

open

High volume of reimage failures in teuthology suites

Added by Laura Flores 10 months ago. Updated 9 months ago.

Status:

New

Priority:

High

Assignee:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

Description of problem¶

There is a high volume of reimaging failures in teuthology suites. For instance, "Error reimaging machines: Expected smithi172's OS to be ubuntu 20.04 but found centos 8 " and "Error reimaging machines: Cannot connect to remote host smithi097".

How to reproduce¶

Run some jobs in a teuthology suite- rados, rbd, etc.

Time noticed / Event Window (in UTC)¶

First reported by Yuri on July 14th, 7:21 AM UTC. The failures have been happening in intermittent time windows since.

In this panel, from July 14th to now, you can see spikes in reimage failures:
https://grafana-route-grafana.apps.os.sepia.ceph.com/d/teuthology/teuthology?orgId=1&from=1689292800000&to=now&refresh=1m&viewPanel=31

Actual results¶

High volume of reimage failures.

Expected results¶

Little to no reimage failures.

Additional analysis¶

I'm seeing a correlation between reimage failures:

Teuthology status reimage results panel: https://grafana-route-grafana.apps.os.sepia.ceph.com/d/teuthology/teuthology?orgId=1&from=1689523724268&to=1689610124268&refresh=1m&viewPanel=31

And load spikes:

Actions

Copy link

Updated by Laura Flores 10 months ago

Copied from Bug #62049: High volume of reimage failures in teuthology suites added

Actions

Copy link

Updated by Laura Flores 10 months ago

Copied from deleted (Bug #62049: High volume of reimage failures in teuthology suites)

Actions

Copy link

Updated by Laura Flores 10 months ago

Linking a recent rados baseline as an example:
https://pulpito.ceph.com/yuriw-2023-07-14_23:46:34-rados-reef-distro-default-smithi/

63/307 jobs (20%) were affected by reimage failures.

Actions

Copy link

Updated by Zack Cerza 10 months ago

A somewhat special example: https://pulpito-ng.ceph.com/akupczyk-2023-07-17_10:29:23-rados-wip-aclamk-pere-log-runway-distro-default-smithi/7341232/

RuntimeError: Expected smithi050's OS to be ubuntu 20.04 but found rhel 8.6

The machine's console log shows it wasn't reimaged at all: http://qa-proxy.ceph.com/teuthology/akupczyk-2023-07-17_10:29:23-rados-wip-aclamk-pere-log-runway-distro-default-smithi/7341232/console_logs/smithi050_reimage.log

Actions

Copy link

Updated by Zack Cerza 10 months ago

More reimaging failures today. This job saw ipmi seem to completely fail for one of the nodes:
http://qa-proxy.ceph.com/teuthology/yuriw-2023-07-27_03:42:08-fs-wip-yuri10-testing-2023-07-26-0858-pacific-distro-default-smithi/7354256/supervisor.7354256.log

2023-07-27T12:08:19.228 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi169.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power status
2023-07-27T12:08:20.337 DEBUG:teuthology.orchestra.console:check power output: 
2023-07-27T12:08:20.338 WARNING:teuthology.contextutil:'wait for power on' reached maximum tries (16) after waiting for 60.0 seconds

Would be great if we could get to the bottom of the IPMI issues
Edit: To be clear, I mean BMCs and/or switches. I don't believe it's e.g. a bug in ipmitool.

Actions

Copy link

Updated by Zack Cerza 10 months ago

Not sure if anyone's looking at these, but this job had skipped reimages, failed ipmi, and "stale jobs detected":
http://qa-proxy.ceph.com/teuthology/yuriw-2023-07-28_14:25:29-rados-wip-yuri7-testing-2023-07-27-1336-quincy-distro-default-smithi/7355454/

Actions

Copy link

Updated by Zack Cerza 10 months ago

Another "wrong OS" job: https://pulpito-ng.ceph.com/runs/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/jobs/7358284

The supervisor log: http://qa-proxy.ceph.com/teuthology/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/7358284/supervisor.7358284.log
The relevant bits:

2023-08-02T17:17:18.577 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /host {"name": "smithi018"}
2023-08-02T17:17:18.652 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /image {"name": "smithi_ubuntu_20.04"}
2023-08-02T17:17:18.694 DEBUG:teuthology.provision.fog.smithi018:Requesting image smithi_ubuntu_20.04 (ID 2)
2023-08-02T17:17:18.695 DEBUG:teuthology.provision.fog.smithi018:FOG request: PUT /host/124 {"imageID": 2}
2023-08-02T17:17:18.755 INFO:teuthology.provision.fog.smithi018:Scheduling deploy of ubuntu 20.04
2023-08-02T17:17:18.755 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /task/active None
2023-08-02T17:17:18.867 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /tasktype {"name": "deploy"}
2023-08-02T17:17:18.899 DEBUG:teuthology.provision.fog.smithi018:FOG request: POST /host/124/task {"taskTypeID": 1}

Note that the three other nodes in that job were reimaged correctly, and of course followed the exact same process.

The node's console log during the reimage process:
http://qa-proxy.ceph.com/teuthology/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/7358284/console_logs/smithi018_reimage.log
The relevant bits:

note: Storage Location 172.21.15.200:/images/, Image name smithi_rhel_8.6

Fog's reimaging log: http://fog.front.sepia.ceph.com/fog/management/index.php?node=report&sub=file&f=aW1hZ2luZyBsb2c%3D
The relevant bits:

fog    smithi018    2023-08-02 07:44:18    2023-08-02 07:46:16    1 minute 58 seconds    smithi_ubuntu_20.04    Deploy
fog    smithi018    2023-08-02 17:19:03    2023-08-02 17:20:01    58 seconds    smithi_rhel_8.6    Deploy

smithi018's page: http://fog.front.sepia.ceph.com/fog/management/index.php?node=host&sub=edit&id=124
The image is set to "smithi_rhel_8.6 - (5)"

So, we are requesting the correct image, but FOG simply ignores that for some reason.

Actions

Copy link

Updated by Laura Flores 10 months ago

Since it's been long suspected that this is a network-related issue, we discussed in the last Infrastructure meeting the idea of adding additional monitoring (i.e. ICMP monitoring https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#icmp_probe) to help us prove whether the network is really at play here.

Plan for this TBD at the next Infrastructure meeting.

Actions

Copy link

Updated by Zack Cerza 9 months ago

Saw a new (to me) problem yesterday. During a period of high failure rates, I saw several jobs had been "in line" for a reimage for over an hour. That's a FOG-ism where the machines have netbooted into FOG's reimage client, but FOG is too busy to serve them at the moment. I also found some "not enough children"-type messages in a PHP log, of all places, so I found and increased those settings.

Still, even during periods where there were no active reimages, these machines sat around and waited, and eventually their jobs failed after over five hours of waiting. An example:

http://qa-proxy.ceph.com/teuthology/yuriw-2023-08-15_16:56:50-fs-wip-yuri7-testing-2023-08-15-0741-pacific-distro-default-smithi/7368819/console_logs/smithi181_reimage.log
The last lines from the console log:

 * Attempting to check in............................Failed
 * No Active Task found for Host: smithi181 (0c:c4:7a:88:7b:9d) (In line for 5:11:25)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Infrastructure

Custom queries

Bug #62050

High volume of reimage failures in teuthology suites

Description of problem¶

How to reproduce¶

Time noticed / Event Window (in UTC)¶

Actual results¶

Expected results¶

Additional analysis¶

Updated by Laura Flores 10 months ago

Updated by Laura Flores 10 months ago

Updated by Laura Flores 10 months ago

Updated by Zack Cerza 10 months ago

Updated by Zack Cerza 10 months ago

Updated by Zack Cerza 10 months ago

Updated by Zack Cerza 10 months ago

Updated by Laura Flores 10 months ago

Updated by Zack Cerza 9 months ago