Project

General

Profile

Actions

Bug #22846

closed

"Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log with msgr-failures/fastclose.yaml

Added by Kefu Chai about 6 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/kchai-2018-01-31_01:48:16-rados-wip-kefu-testing-2018-01-31-0034-distro-basic-mira/2130028

i am not sure it's caused by the fastclose.yaml setting. mon.b failed to respond to mon.a 's paxos(begin) message in a timely manner. and also was unable to rejoin the quorum in 15 seconds. it kept trying to start the election, and didn't respond to mon.a 's probe message. mon.a was the leader before the election was started.


2018-01-31 10:11:24.574 7f894510d700 10 mon.a@0(leader).paxos(paxos updating c 1..164)  sending begin to mon.1
2018-01-31 10:11:24.574 7f894510d700 10 mon.a@0(leader).paxos(paxos updating c 1..164)  sending begin to mon.2
...
2018-01-31 10:11:24.578 7f8942908700  1 -- 172.21.6.138:6789/0 <== mon.2 172.21.6.138:6790/0 489 ==== paxos(accept lc 164 fc 0 pn 300 opn 0) v4 ==== 84+0+0 (4169294484 0 0) 0x556113b51c00 con 0x556113faf500
...
2018-01-31 10:11:33.826 7f8942908700  1 -- 172.21.6.138:6789/0 <== mon.1 172.21.7.104:6789/0 648 ==== paxos(accept lc 164 fc 0 pn 300 opn 0) v4 ==== 84+0+0 (3515085151 0 0) 0x55611429f900 con 0x556113faee00

Related issues 1 (1 open0 closed)

Related to sepia - Bug #22926: mira090 1 of 8 drives need replacingNewDavid Galloway02/06/2018

Actions
Actions #1

Updated by Kefu Chai about 6 years ago

  • Description updated (diff)
Actions #2

Updated by Kefu Chai about 6 years ago

/a/kchai-2018-02-05_11:48:09-rados-wip-kefu-testing-2018-02-05-1650-distro-basic-mira/2155015

interesting enough, it's mon.b again which failed to reply mon.a's proposal in time. and again, mon.b was deployed onto mira090.

Actions #3

Updated by Kefu Chai about 6 years ago

on mira090, we have

[  391.562081] sd 0:0:0:7: [sdh] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  391.562089] sd 0:0:0:7: [sdh] tag#9 Sense Key : Illegal Request [current]
[  391.562094] sd 0:0:0:7: [sdh] tag#9 Add. Sense: Invalid command operation code
[  391.562098] sd 0:0:0:7: [sdh] tag#9 CDB: Write same(16) 93 08 00 00 00 00 04 7f ff f7 00 7f ff ff 00 00
[  391.562101] blk_update_request: critical target error, dev sdh, sector 75497463
Actions #4

Updated by Kefu Chai about 6 years ago

  • Related to Bug #22926: mira090 1 of 8 drives need replacing added
Actions #5

Updated by Josh Durgin about 6 years ago

  • Status changed from New to Closed

looks like just a bad disk

Actions #6

Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers
  • Category deleted (Correctness/Safety)
Actions

Also available in: Atom PDF