Actions
Bug #45441
closedrados: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log'
% Done:
0%
Source:
Q/A
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/a/yuriw-2020-05-05_15:20:13-rados-wip-yuri8-testing-2020-05-04-2117-octopus-distro-basic-smithi/5024888
mon.0 = a mon.1 = b mon.2 = c monmap e1: 3 mons at {a=[v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0],b=[v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0],c=[v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0]} From the log of 'a' Look for warning. 2020-05-06T02:34:28.022+0000 7fe456d4d700 0 log_channel(cluster) log [WRN] : Health check failed: 1/3 mons down, quorum a,c (MON_DOWN) Look for previous reference to mon.1 (b) 2020-05-06T02:34:23.019+0000 7fe456d4d700 5 mon.a@0(electing).elector(7) handle_propose from mon.1 2020-05-06T02:34:23.019+0000 7fe456d4d700 10 mon.a@0(electing).elector(7) handle_propose required features 2449958747315978244 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]), peer features 4540138292836696063 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]) 2020-05-06T02:34:23.019+0000 7fe45ad55700 1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 msgr2=0x560582736880 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 34 2020-05-06T02:34:23.019+0000 7fe45ad55700 1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 msgr2=0x560582736880 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2020-05-06T02:34:23.019+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 0x560582736880 secure :-1 s=READY pgs=4 cs=0 l=0 rx=0x5605834560f0 tx=0x56058343d920).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 2020-05-06T02:34:23.019+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 0x560582736880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby 2020-05-06T02:34:23.020+0000 7fe455d4b700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).accept 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 2020-05-06T02:34:23.020+0000 7fe45ad55700 0 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 34 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).stop 2020-05-06T02:34:23.020+0000 7fe456d4d700 10 mon.a@0(electing) e1 ms_handle_reset 0x5605835e2c00 - From the log of 'b' 2020-05-06T02:34:23.018+0000 7fc09da60700 10 paxos.1).electionLogic(6) bump_epoch6 to 7 2020-05-06T02:34:23.018+0000 7fc0a2269700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=THROTTLE_DONE pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_epilogue_main read frame epilogue bytes=32 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 join_election 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 _reset 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).auth v2 _set_mon_num_rank num 0 rank 0 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 cancel_probe_timeout (none scheduled) 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 timecheck_finish 2020-05-06T02:34:23.019+0000 7fc09da60700 15 mon.b@1(electing) e1 health_tick_stop 2020-05-06T02:34:23.019+0000 7fc09da60700 15 mon.b@1(electing) e1 health_interval_stop 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 scrub_event_cancel 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 scrub_reset 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxos(paxos recovering c 1..8) restart -- canceling timeouts 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mdsmap 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(osdmap 1..2) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(logm 1..3) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(monmap 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(auth 1..2) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mgr 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mgrstat 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(health 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(config 1..2) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] send_to--> mon [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- ?+0 0x5567c6dcdb00 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] --> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- 0x5567c6dcdb00 con 0x5567c7ff2c00 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] send_to--> mon [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- ?+0 0x5567c7c2e6c0 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] --> [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- 0x5567c7c2e6c0 con 0x5567c7ff2000 2020-05-06T02:34:23.019+0000 7fc09ca5e700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] conn(0x5567c7ff2000 0x5567c6e2cd00 secure :-1 s=READY pgs=6 cs=0 l=0 rx=0x5567c7b4c0f0 tx=0x5567c7b92000).handle_read_frame_epilogue_main read frame epilogue bytes=32 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_epilogue_main read frame epilogue bytes=32 2020-05-06T02:34:23.019+0000 7fc0a2269700 0 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 31 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] <== mon.1 v2:172.21.15.133:3301/0 0 ==== log(1 entries from seq 4 at 2020-05-06T02:34:23.018657+0000) v1 ==== 0+0+0 (unknown 0 0 0) 0x5567c7bf8c40 con 0x5567c6ea5400
This seems to be related to the failure injection and we probably need to whitelist 'MON_DOWN' warnings in the cases where we are injecting failures.
Actions