Project

General

Profile

Actions

Bug #58098

closed

qa/workunits/rados/test_crash.sh: crashes are never posted

Added by Laura Flores over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-11-23_15:09:06-rados-wip-yuri10-testing-2022-11-22-1711-distro-default-smithi/7087281

2022-11-23T18:09:31.261 INFO:tasks.workunit.client.0.smithi149.stderr:+ sudo systemctl restart ceph-crash
2022-11-23T18:09:31.261 INFO:tasks.workunit.client.0.smithi149.stderr:+ sleep 30
2022-11-23T18:09:31.994 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~5s
2022-11-23T18:09:31.995 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~5s
2022-11-23T18:09:31.995 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~5s
2022-11-23T18:09:37.197 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~10s
2022-11-23T18:09:37.198 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~10s
2022-11-23T18:09:37.198 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~10s
2022-11-23T18:09:42.400 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~16s
2022-11-23T18:09:42.400 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~16s
2022-11-23T18:09:42.401 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~16s
2022-11-23T18:09:47.603 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~21s
2022-11-23T18:09:47.603 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~21s
2022-11-23T18:09:47.603 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~21s
2022-11-23T18:09:52.805 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~26s
2022-11-23T18:09:52.805 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~26s
2022-11-23T18:09:52.805 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~26s
2022-11-23T18:09:58.008 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~31s
2022-11-23T18:09:58.008 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~31s
2022-11-23T18:09:58.008 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~31s
2022-11-23T18:10:01.262 INFO:tasks.workunit.client.0.smithi149.stderr:++ ceph crash ls
2022-11-23T18:10:01.263 INFO:tasks.workunit.client.0.smithi149.stderr:++ wc -l
2022-11-23T18:10:01.699 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-11-23T18:10:01.700 INFO:tasks.workunit.client.0.smithi149.stderr:+ '[' 0 = 4 ']'
2022-11-23T18:10:01.700 INFO:tasks.workunit.client.0.smithi149.stderr:+ exit 1

The issue here seems to be that we are checking for crashes too early. After inducing crashes, we give ceph-crash 30 seconds to restart, but it sometimes takes longer than that for osds to restart. In this case, osds 0, 1, and 2 took 31 seconds to restart.

A possible fix is to sleep for longer, or check for crashes in a time loop.


Files

journalctl-b0.gz (97.5 KB) journalctl-b0.gz Laura Flores, 12/06/2022 03:42 PM
Actions

Also available in: Atom PDF