Project

General

Profile

Actions

Bug #45441

closed

rados: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log'

Added by Brad Hubbard almost 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2020-05-05_15:20:13-rados-wip-yuri8-testing-2020-05-04-2117-octopus-distro-basic-smithi/5024888

mon.0 = a
mon.1 = b
mon.2 = c

monmap e1: 3 mons at {a=[v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0],b=[v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0],c=[v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0]}

From the log of 'a'

Look for warning.

2020-05-06T02:34:28.022+0000 7fe456d4d700  0 log_channel(cluster) log [WRN] : Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)

Look for previous reference to mon.1 (b)

2020-05-06T02:34:23.019+0000 7fe456d4d700  5 mon.a@0(electing).elector(7) handle_propose from mon.1
2020-05-06T02:34:23.019+0000 7fe456d4d700 10 mon.a@0(electing).elector(7) handle_propose required features 2449958747315978244 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]), peer features 4540138292836696063 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus])
2020-05-06T02:34:23.019+0000 7fe45ad55700  1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 msgr2=0x560582736880 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 34
2020-05-06T02:34:23.019+0000 7fe45ad55700  1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 msgr2=0x560582736880 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2020-05-06T02:34:23.019+0000 7fe45ad55700  1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 0x560582736880 secure :-1 s=READY pgs=4 cs=0 l=0 rx=0x5605834560f0 tx=0x56058343d920).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-05-06T02:34:23.019+0000 7fe45ad55700  1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 0x560582736880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby
2020-05-06T02:34:23.020+0000 7fe455d4b700  1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>  conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).accept
2020-05-06T02:34:23.020+0000 7fe45ad55700  1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>  conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2020-05-06T02:34:23.020+0000 7fe45ad55700  0 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>  conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure
2020-05-06T02:34:23.020+0000 7fe45ad55700  1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>  conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 34
2020-05-06T02:34:23.020+0000 7fe45ad55700  1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>  conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2020-05-06T02:34:23.020+0000 7fe45ad55700  1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>  conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-05-06T02:34:23.020+0000 7fe45ad55700  1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>  conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).stop
2020-05-06T02:34:23.020+0000 7fe456d4d700 10 mon.a@0(electing) e1 ms_handle_reset 0x5605835e2c00 -

From the log of 'b'

2020-05-06T02:34:23.018+0000 7fc09da60700 10 paxos.1).electionLogic(6) bump_epoch6 to 7
2020-05-06T02:34:23.018+0000 7fc0a2269700  1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=THROTTLE_DONE pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_epilogue_main read frame epilogue bytes=32
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 join_election
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 _reset
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).auth v2 _set_mon_num_rank num 0 rank 0
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 cancel_probe_timeout (none scheduled)
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 timecheck_finish
2020-05-06T02:34:23.019+0000 7fc09da60700 15 mon.b@1(electing) e1 health_tick_stop
2020-05-06T02:34:23.019+0000 7fc09da60700 15 mon.b@1(electing) e1 health_interval_stop
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 scrub_event_cancel
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 scrub_reset
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxos(paxos recovering c 1..8) restart -- canceling timeouts
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mdsmap 1..1) restart
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(osdmap 1..2) restart
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(logm 1..3) restart
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(monmap 1..1) restart
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(auth 1..2) restart
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mgr 1..1) restart
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mgrstat 1..1) restart
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(health 1..1) restart
2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(config 1..2) restart
2020-05-06T02:34:23.019+0000 7fc09da60700  1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] send_to--> mon [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- ?+0 0x5567c6dcdb00
2020-05-06T02:34:23.019+0000 7fc09da60700  1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] --> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- 0x5567c6dcdb00 con 0x5567c7ff2c00
2020-05-06T02:34:23.019+0000 7fc09da60700  1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] send_to--> mon [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- ?+0 0x5567c7c2e6c0
2020-05-06T02:34:23.019+0000 7fc09da60700  1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] --> [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- 0x5567c7c2e6c0 con 0x5567c7ff2000
2020-05-06T02:34:23.019+0000 7fc09ca5e700  1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] conn(0x5567c7ff2000 0x5567c6e2cd00 secure :-1 s=READY pgs=6 cs=0 l=0 rx=0x5567c7b4c0f0 tx=0x5567c7b92000).handle_read_frame_epilogue_main read frame epilogue bytes=32
2020-05-06T02:34:23.019+0000 7fc0a2269700  1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_epilogue_main read frame epilogue bytes=32
2020-05-06T02:34:23.019+0000 7fc0a2269700  0 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure
2020-05-06T02:34:23.019+0000 7fc0a2269700  1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 31
2020-05-06T02:34:23.019+0000 7fc0a2269700  1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2020-05-06T02:34:23.019+0000 7fc0a2269700  1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-05-06T02:34:23.019+0000 7fc0a2269700  1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby
2020-05-06T02:34:23.019+0000 7fc09da60700  1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] <== mon.1 v2:172.21.15.133:3301/0 0 ==== log(1 entries from seq 4 at 2020-05-06T02:34:23.018657+0000) v1 ==== 0+0+0 (unknown 0 0 0) 0x5567c7bf8c40 con 0x5567c6ea5400

This seems to be related to the failure injection and we probably need to whitelist 'MON_DOWN' warnings in the cases where we are injecting failures.


Related issues 3 (0 open3 closed)

Has duplicate RADOS - Bug #44694: MON_DOWN during cluster setupDuplicate

Actions
Copied to RADOS - Backport #49401: pacific: rados: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log'ResolvedNeha OjhaActions
Copied to RADOS - Backport #49402: octopus: rados: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log'ResolvedNeha OjhaActions
Actions

Also available in: Atom PDF