Project

General

Profile

Bug #45441

Updated by Brad Hubbard almost 4 years ago

 
 /a/yuriw-2020-05-05_15:20:13-rados-wip-yuri8-testing-2020-05-04-2117-octopus-distro-basic-smithi/5024888 

 <pre> 
 mon.0 = a 
 mon.1 = b 
 mon.2 = c 

 monmap e1: 3 mons at {a=[v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0],b=[v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0],c=[v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0]} 

 From the log of 'a' 

 Look for warning. 

 2020-05-06T02:34:28.022+0000 7fe456d4d700    0 log_channel(cluster) log [WRN] : Health check failed: 1/3 mons down, quorum a,c (MON_DOWN) 

 Look for previous reference to mon.1 (b) 

 2020-05-06T02:34:23.019+0000 7fe456d4d700    5 mon.a@0(electing).elector(7) handle_propose from mon.1 
 2020-05-06T02:34:23.019+0000 7fe456d4d700 10 mon.a@0(electing).elector(7) handle_propose required features 2449958747315978244 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]), peer features 4540138292836696063 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus]) 
 2020-05-06T02:34:23.019+0000 7fe45ad55700    1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 msgr2=0x560582736880 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 34 
 2020-05-06T02:34:23.019+0000 7fe45ad55700    1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 msgr2=0x560582736880 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 
 2020-05-06T02:34:23.019+0000 7fe45ad55700    1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 0x560582736880 secure :-1 s=READY pgs=4 cs=0 l=0 rx=0x5605834560f0 tx=0x56058343d920).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 
 2020-05-06T02:34:23.019+0000 7fe45ad55700    1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >> [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] conn(0x5605838fcc00 0x560582736880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby 
 2020-05-06T02:34:23.020+0000 7fe455d4b700    1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>    conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).accept 
 2020-05-06T02:34:23.020+0000 7fe45ad55700    1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>    conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 
 2020-05-06T02:34:23.020+0000 7fe45ad55700    0 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>    conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure 
 2020-05-06T02:34:23.020+0000 7fe45ad55700    1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>    conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 34 
 2020-05-06T02:34:23.020+0000 7fe45ad55700    1 -- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>    conn(0x5605835e2c00 msgr2=0x560583472d80 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 
 2020-05-06T02:34:23.020+0000 7fe45ad55700    1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>    conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 
 2020-05-06T02:34:23.020+0000 7fe45ad55700    1 --2- [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] >>    conn(0x5605835e2c00 0x560583472d80 unknown :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).stop 
 2020-05-06T02:34:23.020+0000 7fe456d4d700 10 mon.a@0(electing) e1 ms_handle_reset 0x5605835e2c00 - 

 From the log of 'b' 

 2020-05-06T02:34:23.018+0000 7fc09da60700 10 paxos.1).electionLogic(6) bump_epoch6 to 7 
 2020-05-06T02:34:23.018+0000 7fc0a2269700    1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=THROTTLE_DONE pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_epilogue_main read frame epilogue bytes=32 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 join_election 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 _reset 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).auth v2 _set_mon_num_rank num 0 rank 0 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 cancel_probe_timeout (none scheduled) 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 timecheck_finish 
 2020-05-06T02:34:23.019+0000 7fc09da60700 15 mon.b@1(electing) e1 health_tick_stop 
 2020-05-06T02:34:23.019+0000 7fc09da60700 15 mon.b@1(electing) e1 health_interval_stop 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 scrub_event_cancel 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing) e1 scrub_reset 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxos(paxos recovering c 1..8) restart -- canceling timeouts 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mdsmap 1..1) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(osdmap 1..2) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(logm 1..3) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(monmap 1..1) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(auth 1..2) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mgr 1..1) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(mgrstat 1..1) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(health 1..1) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700 10 mon.b@1(electing).paxosservice(config 1..2) restart 
 2020-05-06T02:34:23.019+0000 7fc09da60700    1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] send_to--> mon [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- ?+0 0x5567c6dcdb00 
 2020-05-06T02:34:23.019+0000 7fc09da60700    1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] --> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- 0x5567c6dcdb00 con 0x5567c7ff2c00 
 2020-05-06T02:34:23.019+0000 7fc09da60700    1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] send_to--> mon [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- ?+0 0x5567c7c2e6c0 
 2020-05-06T02:34:23.019+0000 7fc09da60700    1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] --> [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] -- election(a2feb1eb-2f05-4b4f-bae6-96248acb7c3c propose rel 15 e7) v8 -- 0x5567c7c2e6c0 con 0x5567c7ff2000 
 2020-05-06T02:34:23.019+0000 7fc09ca5e700    1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3302/0,v1:172.21.15.133:6791/0] conn(0x5567c7ff2000 0x5567c6e2cd00 secure :-1 s=READY pgs=6 cs=0 l=0 rx=0x5567c7b4c0f0 tx=0x5567c7b92000).handle_read_frame_epilogue_main read frame epilogue bytes=32 
 2020-05-06T02:34:23.019+0000 7fc0a2269700    1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_epilogue_main read frame epilogue bytes=32 
 2020-05-06T02:34:23.019+0000 7fc0a2269700    0 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure 
 2020-05-06T02:34:23.019+0000 7fc0a2269700    1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 31 
 2020-05-06T02:34:23.019+0000 7fc0a2269700    1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 msgr2=0x5567c6e2c880 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 
 2020-05-06T02:34:23.019+0000 7fc0a2269700    1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0x5567c7b4c300 tx=0x5567c7b33920).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 
 2020-05-06T02:34:23.019+0000 7fc0a2269700    1 --2- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] >> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] conn(0x5567c7ff2c00 0x5567c6e2c880 unknown :-1 s=READY pgs=4 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby 
 2020-05-06T02:34:23.019+0000 7fc09da60700    1 -- [v2:172.21.15.133:3301/0,v1:172.21.15.133:6790/0] <== mon.1 v2:172.21.15.133:3301/0 0 ==== log(1 entries from seq 4 at 2020-05-06T02:34:23.018657+0000) v1 ==== 0+0+0 (unknown 0 0 0) 0x5567c7bf8c40 con 0x5567c6ea5400 
 </pre> 

 This seems to be related to the failure injection and we probably need to whitelist 'MON_DOWN' warnings in the cases where we are injecting failures.

Back