Bug #45441
closedrados: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log'
0%
Description
/a/yuriw-2020-05-05_15:20:13-rados-wip-yuri8-testing-2020-05-04-2117-octopus-distro-basic-smithi/5024888
mon.0 = a mon.1 = b mon.2 = c monmap e1: 3 mons at {a=[v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0],b=[v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0],c=[v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0]} From the log of 'a' Look for warning. 2020-05-06T02:34:28.022+0000 7fe456d4d700 0 log_channel(cluster) log [WRN] : Health check failed: 1/3 mons down, quorum a,c (MON_DOWN) Look for previous reference to mon.1 (b) 2020-05-06T02:34:23.019+0000 7fe456d4d700 5 mon.a@0(electing).elector(7) handle_propose from mon.1 2020-05-06T02:34:23.019+0000 7fe456d4d700 10 mon.a@0(electing).elector(7) handle_propose required features 2449958747315978244 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]), peer features 4540138292836696063 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]) 2020-05-06T02:34:23.019+0000 7fe45ad55700 1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 msgr2=0x560582736880 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 34 2020-05-06T02:34:23.019+0000 7fe45ad55700 1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 msgr2=0x560582736880 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2020-05-06T02:34:23.019+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 0x560582736880 secure :-1 s=READY pgs=4 cs=0 l=0 rx=0x5605834560f0 tx=0x56058343d920).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 2020-05-06T02:34:23.019+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 0x560582736880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby 2020-05-06T02:34:23.020+0000 7fe455d4b700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).accept 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 2020-05-06T02:34:23.020+0000 7fe45ad55700 0 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 34 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 2020-05-06T02:34:23.020+0000 7fe45ad55700 1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).stop 2020-05-06T02:34:23.020+0000 7fe456d4d700 10 mon.a@0(electing) e1 ms_handle_reset 0x5605835e2c00 - From the log of 'b' 2020-05-06T02:34:23.018+0000 7fc09da60700 10 paxos.1).electionLogic(6) bump_epoch6 to 7 2020-05-06T02:34:23.018+0000 7fc0a2269700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=THROTTLE_DONE pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_epilogue_main read frame epilogue bytes=32 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 join_election 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 _reset 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).auth v2 _set_mon_num_rank num 0 rank 0 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 cancel_probe_timeout (none scheduled) 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 timecheck_finish 2020-05-06T02:34:23.019+0000 7fc09da60700 15 mon.b@1(electing) e1 health_tick_stop 2020-05-06T02:34:23.019+0000 7fc09da60700 15 mon.b@1(electing) e1 health_interval_stop 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 scrub_event_cancel 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 scrub_reset 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxos(paxos recovering c 1..8) restart -- canceling timeouts 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mdsmap 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(osdmap 1..2) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(logm 1..3) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(monmap 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(auth 1..2) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mgr 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mgrstat 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(health 1..1) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(config 1..2) restart 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] send_to--> mon [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- ?+0 0x5567c6dcdb00 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] --> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- 0x5567c6dcdb00 con 0x5567c7ff2c00 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] send_to--> mon [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- ?+0 0x5567c7c2e6c0 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] --> [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- 0x5567c7c2e6c0 con 0x5567c7ff2000 2020-05-06T02:34:23.019+0000 7fc09ca5e700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] conn(0x5567c7ff2000 0x5567c6e2cd00 secure :-1 s=READY pgs=6 cs=0 l=0 rx=0x5567c7b4c0f0 tx=0x5567c7b92000).handle_read_frame_epilogue_main read frame epilogue bytes=32 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_epilogue_main read frame epilogue bytes=32 2020-05-06T02:34:23.019+0000 7fc0a2269700 0 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 31 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 2020-05-06T02:34:23.019+0000 7fc0a2269700 1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby 2020-05-06T02:34:23.019+0000 7fc09da60700 1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] <== mon.1 v2:172.21.15.133:3301/0 0 ==== log(1 entries from seq 4 at 2020-05-06T02:34:23.018657+0000) v1 ==== 0+0+0 (unknown 0 0 0) 0x5567c7bf8c40 con 0x5567c6ea5400
This seems to be related to the failure injection and we probably need to whitelist 'MON_DOWN' warnings in the cases where we are injecting failures.
Updated by Sridhar Seshasayee almost 4 years ago
Observing the issue during this run:
/a/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/5176239
Failure Reason:
"2020-06-24T20:49:52.611951+0000 mon.a (mon.0) 40 : cluster [WRN] Health check failed: 1/3 mons down, quorum a,b (MON_DOWN)" in cluster log
job ID: 5176239
Updated by Brad Hubbard over 3 years ago
/a/yuriw-2020-08-06_00:31:28-rados-wip-yuri8-testing-octopus-distro-basic-smithi/5290942
Updated by Deepika Upadhyay over 3 years ago
yuriw-2020-08-20_19:48:15-rados-wip-yuri-testing-2020-08-17-1723-octopus-distro-basic-smithi/5362945/
Updated by Neha Ojha over 3 years ago
- Backport set to octopus
/a/teuthology-2020-09-22_07:01:02-rados-master-distro-basic-smithi/5458722
Updated by Neha Ojha over 3 years ago
Neha Ojha wrote:
/a/teuthology-2020-09-22_07:01:02-rados-master-distro-basic-smithi/5458722
rados/thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log 2-recovery-overrides/{more-async-recovery} backoff/peering_and_degraded ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/on mon_election/connectivity msgr-failures/osd-delay msgr/async objectstore/bluestore-comp-snappy rados supported-random-distro$/{rhel_8} thrashers/none thrashosds-health workloads/cache-pool-snaps}
Updated by Neha Ojha over 3 years ago
rados/singleton/{all/osd-recovery-incomplete mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{rhel_8}}
/a/teuthology-2020-09-22_07:01:02-rados-master-distro-basic-smithi/5458825
Updated by Neha Ojha over 3 years ago
rados/singleton/{all/osd-recovery-incomplete mon_election/connectivity msgr-failures/many msgr/async-v2only objectstore/bluestore-comp-zlib rados supported-random-distro$/{rhel_8}}
/a/teuthology-2020-09-30_07:01:02-rados-master-distro-basic-smithi/5483697
Updated by Neha Ojha over 3 years ago
rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/fast mon_election/connectivity msgr-failures/osd-delay rados recovery-overrides/{more-async-recovery} supported-random-distro$/{rhel_8} thrashers/morepggrow thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}
/a/teuthology-2020-10-21_07:01:02-rados-master-distro-basic-smithi/5545056
Updated by Neha Ojha over 3 years ago
rados/singleton/{all/mon-config mon_election/connectivity msgr-failures/many msgr/async objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8}}
/a/teuthology-2020-10-28_07:01:02-rados-master-distro-basic-smithi/5567279
Updated by Neha Ojha over 3 years ago
- Has duplicate Bug #44694: MON_DOWN during cluster setup added
Updated by Brad Hubbard over 3 years ago
/a/teuthology-2020-10-28_07:01:02-rados-master-distro-basic-smithi/5567279 shows the following at the time the failure was reported.
2020-10-28T08:59:57.904+0000 7fc52cda0700 10 mon.c@2(peon).elector(6) handle_ping mon_ping(ping_reply stamp 2020-10-28T08:59:57.904334+0000) v1 2020-10-28T08:59:57.945+0000 7fc52bd9e700 1 --2- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2400 0x558c45bb7900 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0).accept 2020-10-28T08:59:57.945+0000 7fc530da8700 1 --2- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2400 0x558c45bb7900 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0 2020-10-28T08:59:57.945+0000 7fc530da8700 0 -- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2400 msgr2=0x558c45bb7900 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure 2020-10-28T08:59:57.945+0000 7fc530da8700 1 -- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2400 msgr2=0x558c45bb7900 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 37 2020-10-28T08:59:57.945+0000 7fc530da8700 1 -- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2400 msgr2=0x558c45bb7900 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2020-10-28T08:59:57.945+0000 7fc530da8700 1 --2- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2400 0x558c45bb7900 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted) 2020-10-28T08:59:57.945+0000 7fc530da8700 1 --2- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2400 0x558c45bb7900 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=1 rev1=1 rx=0 tx=0).stop 2020-10-28T08:59:57.945+0000 7fc52cda0700 10 mon.c@2(peon) e1 ms_handle_reset 0x558c45bb2400 - 2020-10-28T08:59:58.346+0000 7fc52bd9e700 1 --2- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2000 0x558c45bb7400 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0).accept 2020-10-28T08:59:58.346+0000 7fc530da8700 1 --2- [v2:172.21.15.188:3302/0,v1:172.21.15.188:6791/0] >> conn(0x558c45bb2000 0x558c45bb7400 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
I guess we could try something like https://github.com/ceph/ceph/pull/30714/commits/ded58ef91d6c8a68de49fa2c6b6e01636515c59b with 'MON_DOWN' ? The question is is this going to mask certain issues and why haven't the 'log-ignorelist' entries been required all along?
Updated by Neha Ojha over 3 years ago
rados/singleton/{all/lost-unfound-delete mon_election/connectivity msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-zstd rados supported-random-distro$/{ubuntu_latest}}
/a/teuthology-2020-12-10_07:01:01-rados-master-distro-basic-smithi/5697910
Updated by Neha Ojha over 3 years ago
- Assignee changed from Brad Hubbard to Neha Ojha
Updated by Neha Ojha over 3 years ago
rados/singleton-nomsgr/{all/recovery-unfound-found mon_election/connectivity rados supported-random-distro$/{centos_8}}
/a/teuthology-2021-01-11_07:01:02-rados-master-distro-basic-smithi/5777338/
Updated by Neha Ojha over 3 years ago
The rados/singleton MON_DOWN failures with msgr-failures/many should be safe to ignore.
/a/nojha-2021-01-20_23:46:47-rados-wip-32027-distro-basic-smithi/5806233/
/a/nojha-2021-01-20_23:46:47-rados-wip-32027-distro-basic-smithi/5806155/
Updated by Deepika Upadhyay about 3 years ago
description: rados/singleton/{all/osd-recovery-incomplete msgr-failures/many msgr/async-v1only objectstore/bluestore-comp-zlib rados supported-random-distro$/{centos_latest}}
Updated by Sage Weil about 3 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 39586
Updated by Sage Weil about 3 years ago
- Backport changed from octopus to pacific,octopus
Updated by Kefu Chai about 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 3 years ago
- Copied to Backport #49401: pacific: rados: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log' added
Updated by Backport Bot about 3 years ago
- Copied to Backport #49402: octopus: rados: Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log' added
Updated by Loïc Dachary about 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".