Project

General

Profile

Bug #45902

thrashosds hits watchdog_daemon_timeout during powercycle

Added by Brad Hubbard almost 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/teuthology-2020-06-02_11:15:03-powercycle-master-testing-basic-smithi/5111856

The run fails with "teuthology.exceptions.CommandFailedError: Command failed (workunit test suites/ffsb.sh) on smithi116 with status 1" but earlier we see the following.

2020-06-04T02:30:30.708 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~303s
2020-06-04T02:30:30.709 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons

The last entry in osd.1s log is it being told to shut down at 2020-06-04T02:25:24

2020-06-04T02:25:24.909+0000 7f5f38726700 -1 osd.1 31 *** Got signal Terminated ***
2020-06-04T02:25:24.909+0000 7f5f38726700 -1 osd.1 31 *** Immediate shutdown (osd_fast_shutdown=true) ***

It looks like it was shutdown due to a power cycle on smithi036.

2020-06-04T02:25:20.803 INFO:tasks.ceph.ceph_manager.ceph:kill_osd on osd.1 doing powercycle of ubuntu@smithi036.front.sepia.ceph.com

Power on completed at 02:29:02.

2020-06-04T02:29:02.878 INFO:teuthology.orchestra.console:Power on for smithi036 completed

But it's not until 02:30:43 that we try to start osd.1 and by then we've already timed out.

2020-06-04T02:30:43.072 DEBUG:teuthology.orchestra.console:expect after: b'smithi036 login: '
2020-06-04T02:30:43.226 INFO:teuthology.misc:Re-opening connections...
2020-06-04T02:30:43.226 INFO:teuthology.misc:trying to connect to ubuntu@smithi036.front.sepia.ceph.com
2020-06-04T02:30:43.227 INFO:teuthology.orchestra.remote:Trying to reconnect to host
2020-06-04T02:30:43.228 DEBUG:teuthology.orchestra.connection:{'hostname': 'smithi036.front.sepia.ceph.com', 'username': 'ubuntu', 'timeout': 60}
2020-06-04T02:30:43.676 INFO:teuthology.orchestra.run.smithi036:> true
2020-06-04T02:30:44.245 DEBUG:teuthology.misc:waited 1.0186738967895508
2020-06-04T02:30:45.247 DEBUG:tasks.ceph_manager:Mounting data for osd.1 on ubuntu@smithi036.front.sepia.ceph.com

I think we should adjust the watchdog_daemon_timeout for the powercycle tests.


Related issues

Related to Ceph - Bug #45900: "ERROR: (22) Invalid argument" in powercycle New
Related to Ceph - Bug #47743: Error ENXIO: problem getting command descriptions from mon Duplicate

History

#1 Updated by Brad Hubbard almost 4 years ago

  • Related to Bug #45900: "ERROR: (22) Invalid argument" in powercycle added

#2 Updated by Brad Hubbard almost 4 years ago

@Neha, I'll take care of this one.

#3 Updated by Neha Ojha almost 4 years ago

Brad Hubbard wrote:

@Neha, I'll take care of this one.

Thanks Brad!

#4 Updated by Deepika Upadhyay over 3 years ago

  • Related to Bug #47743: Error ENXIO: problem getting command descriptions from mon added

Also available in: Atom PDF