Project

General

Profile

Actions

Bug #45318

open

Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml

Added by Brad Hubbard about 4 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
medium-hanging-fruit
Backport:
pacific,octopus,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/teuthology-2020-04-26_02:30:03-rados-octopus-distro-basic-smithi/4984906

The MON log shows it came back up around 09:41:11.

2020-04-26T09:41:09.547+0000 7f3e7a667540  4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work
2020-04-26T09:41:09.551+0000 7f3e7a667540  4 rocksdb: [db/db_impl.cc:563] Shutdown complete
2020-04-26T09:41:09.551+0000 7f3e7a667540  0 ceph-mon: created monfs at /var/lib/ceph/mon/ceph-d for mon.d
2020-04-26T09:41:11.139+0000 7f2561752540  0 ceph version 15.2.1-136-ga8c125c7d7 (a8c125c7d78f5cd973863993d258cd717ade4c99) octopus (stable), process ceph-mon, pid 12193

Around that time in the teuthology log we see.

2020-04-26T09:41:11.031 INFO:tasks.ceph.mon.d:Restarting daemon
2020-04-26T09:41:11.031 INFO:teuthology.orchestra.run.smithi101:> true
2020-04-26T09:41:11.036 INFO:teuthology.orchestra.run.smithi101:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i d
2020-04-26T09:41:11.082 INFO:tasks.ceph.mon.d:Started

So it is restarted around 09:41:11 but the warning is issued at 09:41:26. The monitor log shows that quorum wasn't achieved until 09:41:27.

2020-04-26T09:41:27.343+0000 7f254c186700  7 mon.d@2(peon).log v7 update_from_paxos applying incremental log 7 2020-04-26T09:41:26.428693+0000 mon.b (mon.0) 27 : cluster [INF] mon.b is new leader, mons b,a,d,c,f,e in quorum (ranks 0,1,2,3,4,5)

I don't think we can whitelist this message if we want to catch actual failures in the mons during the test.


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #57900: mon/crush_ops.sh: mons out of quorumIn ProgressLaura Flores

Actions
Actions

Also available in: Atom PDF