Bug #62050
openHigh volume of reimage failures in teuthology suites
0%
Description
Description of problem¶
There is a high volume of reimaging failures in teuthology suites. For instance, "Error reimaging machines: Expected smithi172's OS to be ubuntu 20.04 but found centos 8 " and "Error reimaging machines: Cannot connect to remote host smithi097".
How to reproduce¶
Run some jobs in a teuthology suite- rados, rbd, etc.
Time noticed / Event Window (in UTC)¶
First reported by Yuri on July 14th, 7:21 AM UTC. The failures have been happening in intermittent time windows since.
In this panel, from July 14th to now, you can see spikes in reimage failures:
https://grafana-route-grafana.apps.os.sepia.ceph.com/d/teuthology/teuthology?orgId=1&from=1689292800000&to=now&refresh=1m&viewPanel=31
Actual results¶
High volume of reimage failures.
Expected results¶
Little to no reimage failures.
Additional analysis¶
I'm seeing a correlation between reimage failures:- Teuthology status reimage results panel: https://grafana-route-grafana.apps.os.sepia.ceph.com/d/teuthology/teuthology?orgId=1&from=1689523724268&to=1689610124268&refresh=1m&viewPanel=31
- Fog system load panel: https://grafana-route-grafana.apps.os.sepia.ceph.com/d/node-exporter/node-exporter?orgId=1&var-origin_prometheus=&var-job=node-exporter&var-hostname=fog.front.sepia.ceph.com&var-node=fog.front.sepia.ceph.com:9100&var-device=All&var-interval=2m&var-maxmount=%2F&var-show_hostname=fog.front.sepia.ceph.com&var-total=6&var-DS_PROMETHEUS=Prometheus&from=1689523787001&to=1689610187001&viewPanel=13
- Teuthology system load panel: https://grafana-route-grafana.apps.os.sepia.ceph.com/d/node-exporter/node-exporter?orgId=1&from=1689523901560&to=1689610301560&var-origin_prometheus=&var-job=node-exporter&var-hostname=teuthology&var-node=teuthology.front.sepia.ceph.com:9100&var-device=All&var-interval=2m&var-maxmount=%2F&var-show_hostname=teuthology&var-total=6&var-DS_PROMETHEUS=Prometheus&viewPanel=13
Updated by Laura Flores 10 months ago
- Copied from Bug #62049: High volume of reimage failures in teuthology suites added
Updated by Laura Flores 10 months ago
- Copied from deleted (Bug #62049: High volume of reimage failures in teuthology suites)
Updated by Laura Flores 10 months ago
Linking a recent rados baseline as an example:
https://pulpito.ceph.com/yuriw-2023-07-14_23:46:34-rados-reef-distro-default-smithi/
63/307 jobs (20%) were affected by reimage failures.
Updated by Zack Cerza 10 months ago
A somewhat special example: https://pulpito-ng.ceph.com/akupczyk-2023-07-17_10:29:23-rados-wip-aclamk-pere-log-runway-distro-default-smithi/7341232/
RuntimeError: Expected smithi050's OS to be ubuntu 20.04 but found rhel 8.6
The machine's console log shows it wasn't reimaged at all: http://qa-proxy.ceph.com/teuthology/akupczyk-2023-07-17_10:29:23-rados-wip-aclamk-pere-log-runway-distro-default-smithi/7341232/console_logs/smithi050_reimage.log
Updated by Zack Cerza 10 months ago
More reimaging failures today. This job saw ipmi seem to completely fail for one of the nodes:
http://qa-proxy.ceph.com/teuthology/yuriw-2023-07-27_03:42:08-fs-wip-yuri10-testing-2023-07-26-0858-pacific-distro-default-smithi/7354256/supervisor.7354256.log
2023-07-27T12:08:19.228 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi169.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power status 2023-07-27T12:08:20.337 DEBUG:teuthology.orchestra.console:check power output: 2023-07-27T12:08:20.338 WARNING:teuthology.contextutil:'wait for power on' reached maximum tries (16) after waiting for 60.0 seconds
Would be great if we could get to the bottom of the IPMI issues
Edit: To be clear, I mean BMCs and/or switches. I don't believe it's e.g. a bug in ipmitool
.
Updated by Zack Cerza 10 months ago
Not sure if anyone's looking at these, but this job had skipped reimages, failed ipmi, and "stale jobs detected":
http://qa-proxy.ceph.com/teuthology/yuriw-2023-07-28_14:25:29-rados-wip-yuri7-testing-2023-07-27-1336-quincy-distro-default-smithi/7355454/
Updated by Zack Cerza 10 months ago
Another "wrong OS" job: https://pulpito-ng.ceph.com/runs/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/jobs/7358284
The supervisor log: http://qa-proxy.ceph.com/teuthology/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/7358284/supervisor.7358284.log
The relevant bits:
2023-08-02T17:17:18.577 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /host {"name": "smithi018"} 2023-08-02T17:17:18.652 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /image {"name": "smithi_ubuntu_20.04"} 2023-08-02T17:17:18.694 DEBUG:teuthology.provision.fog.smithi018:Requesting image smithi_ubuntu_20.04 (ID 2) 2023-08-02T17:17:18.695 DEBUG:teuthology.provision.fog.smithi018:FOG request: PUT /host/124 {"imageID": 2} 2023-08-02T17:17:18.755 INFO:teuthology.provision.fog.smithi018:Scheduling deploy of ubuntu 20.04 2023-08-02T17:17:18.755 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /task/active None 2023-08-02T17:17:18.867 DEBUG:teuthology.provision.fog.smithi018:FOG request: GET /tasktype {"name": "deploy"} 2023-08-02T17:17:18.899 DEBUG:teuthology.provision.fog.smithi018:FOG request: POST /host/124/task {"taskTypeID": 1}
Note that the three other nodes in that job were reimaged correctly, and of course followed the exact same process.
The node's console log during the reimage process:
http://qa-proxy.ceph.com/teuthology/zack-2023-08-02_17:13:43-powercycle-reef-distro-default-smithi/7358284/console_logs/smithi018_reimage.log
The relevant bits:
note: Storage Location 172.21.15.200:/images/, Image name smithi_rhel_8.6
Fog's reimaging log: http://fog.front.sepia.ceph.com/fog/management/index.php?node=report&sub=file&f=aW1hZ2luZyBsb2c%3D
The relevant bits:
fog smithi018 2023-08-02 07:44:18 2023-08-02 07:46:16 1 minute 58 seconds smithi_ubuntu_20.04 Deploy fog smithi018 2023-08-02 17:19:03 2023-08-02 17:20:01 58 seconds smithi_rhel_8.6 Deploy
smithi018's page: http://fog.front.sepia.ceph.com/fog/management/index.php?node=host&sub=edit&id=124
The image is set to "smithi_rhel_8.6 - (5)"
So, we are requesting the correct image, but FOG simply ignores that for some reason.
Updated by Laura Flores 10 months ago
Since it's been long suspected that this is a network-related issue, we discussed in the last Infrastructure meeting the idea of adding additional monitoring (i.e. ICMP monitoring https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#icmp_probe) to help us prove whether the network is really at play here.
Plan for this TBD at the next Infrastructure meeting.
Updated by Zack Cerza 9 months ago
Saw a new (to me) problem yesterday. During a period of high failure rates, I saw several jobs had been "in line" for a reimage for over an hour. That's a FOG-ism where the machines have netbooted into FOG's reimage client, but FOG is too busy to serve them at the moment. I also found some "not enough children"-type messages in a PHP log, of all places, so I found and increased those settings.
Still, even during periods where there were no active reimages, these machines sat around and waited, and eventually their jobs failed after over five hours of waiting. An example:
http://qa-proxy.ceph.com/teuthology/yuriw-2023-08-15_16:56:50-fs-wip-yuri7-testing-2023-08-15-0741-pacific-distro-default-smithi/7368819/console_logs/smithi181_reimage.log
The last lines from the console log:
* Attempting to check in............................Failed * No Active Task found for Host: smithi181 (0c:c4:7a:88:7b:9d) (In line for 5:11:25)