Bug #20910
closedspurious MON_DOWN, apparently slow/laggy mon
0%
Description
mon shows very slow progress for ~10 seconds, failing to send lease renewals etc, and triggering an election
2017-08-04 09:12:32.645368 7f009b9cd700 10 mon.b@0(leader).paxos(paxos updating c 1..181) handle_accept paxos(accept lc 181 fc 0 pn 300 opn 0) v4 2017-08-04 09:12:32.645371 7f009b9cd700 10 mon.b@0(leader).paxos(paxos updating c 1..181) now 0,1,2 have accepted 2017-08-04 09:12:32.645373 7f009b9cd700 10 mon.b@0(leader).paxos(paxos updating c 1..181) got majority, committing, done with update 2017-08-04 09:12:32.645375 7f009b9cd700 10 mon.b@0(leader).paxos(paxos updating c 1..181) commit_start 182 2017-08-04 09:12:32.699610 7f00989c7700 1 -- 172.21.15.184:6789/0 >> - conn(0x559a31f37000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=40 - 2017-08-04 09:12:32.699765 7f00989c7700 10 mon.b@0(leader) e1 ms_verify_authorizer 172.21.15.194:0/270168725 client protocol 0 2017-08-04 09:12:32.699775 7f00989c7700 10 In get_auth_session_handler for protocol 0 2017-08-04 09:12:32.699912 7f009b9cd700 1 -- 172.21.15.184:6789/0 <== client.? 172.21.15.194:0/270168725 1 ==== auth(proto 0 30 bytes epoch 0) v1 ==== 60+0+0 (673663173 0 0) 0x559a31c2ec80 con 0x559a31f37000 2017-08-04 09:12:32.699932 7f009b9cd700 10 mon.b@0(leader) e1 _ms_dispatch new session 0x559a32837d40 MonSession(client.? 172.21.15.194:0/270168725 is open ) features 0x1ffddff8eea4fffb 2017-08-04 09:12:32.699937 7f009b9cd700 20 mon.b@0(leader) e1 caps 2017-08-04 09:12:32.699939 7f009b9cd700 10 mon.b@0(leader).paxosservice(auth 1..7) dispatch 0x559a31c2ec80 auth(proto 0 30 bytes epoch 0) v1 from client.? 172.21.15.194:0/270168725 con 0x559a31f37000 2017-08-04 09:12:32.699945 7f009b9cd700 5 mon.b@0(leader).paxos(paxos writing c 1..181) is_readable = 0 - now=2017-08-04 09:12:32.699946 lease_expire=2017-08-04 09:12:32.350979 has v0 lc 181 2017-08-04 09:12:32.699965 7f009b9cd700 10 mon.b@0(leader).paxosservice(auth 1..7) waiting for paxos -> readable (v0) 2017-08-04 09:12:32.699972 7f009b9cd700 5 mon.b@0(leader).paxos(paxos writing c 1..181) is_readable = 0 - now=2017-08-04 09:12:32.699972 lease_expire=2017-08-04 09:12:32.350979 has v0 lc 181 2017-08-04 09:12:34.824279 7f00971c4700 20 mon.b@0(leader).paxos(paxos writing c 1..181) commit_finish 182 2017-08-04 09:12:34.824323 7f00971c4700 10 mon.b@0(leader).paxos(paxos writing c 1..182) sending commit to mon.1 2017-08-04 09:12:34.824335 7f00971c4700 1 -- 172.21.15.184:6789/0 _send_message--> mon.1 172.21.15.194:6789/0 -- paxos(commit lc 182 fc 0 pn 300 opn 0) v4 -- ?+0 0x559a32864080 2017-08-04 09:12:34.824347 7f00971c4700 1 -- 172.21.15.184:6789/0 --> 172.21.15.194:6789/0 -- paxos(commit lc 182 fc 0 pn 300 opn 0) v4 -- 0x559a32864080 con 0 2017-08-04 09:12:34.824361 7f00971c4700 10 mon.b@0(leader).paxos(paxos writing c 1..182) sending commit to mon.2 2017-08-04 09:12:34.824365 7f00971c4700 1 -- 172.21.15.184:6789/0 _send_message--> mon.2 172.21.15.194:6790/0 -- paxos(commit lc 182 fc 0 pn 300 opn 0) v4 -- ?+0 0x559a32864700 2017-08-04 09:12:34.824369 7f00971c4700 1 -- 172.21.15.184:6789/0 --> 172.21.15.194:6790/0 -- paxos(commit lc 182 fc 0 pn 300 opn 0) v4 -- 0x559a32864700 con 0 2017-08-04 09:12:34.824379 7f00971c4700 10 mon.b@0(leader) e1 refresh_from_paxos 2017-08-04 09:12:34.824398 7f00971c4700 10 mon.b@0(leader).paxosservice(pgmap 1..59) refresh 2017-08-04 09:12:34.824406 7f00979c5700 10 _calc_signature seq 630 front_crc_ = 3022017222 middle_crc = 0 data_crc = 0 sig = 10246470789152800045 2017-08-04 09:12:34.824406 7f00981c6700 10 _calc_signature seq 876 front_crc_ = 1529781274 middle_crc = 0 data_crc = 0 sig = 87697745521067598 2017-08-04 09:12:34.824412 7f00981c6700 20 Putting signature in client message(seq # 876): sig = 87697745521067598 2017-08-04 09:12:34.824417 7f00979c5700 20 Putting signature in client message(seq # 630): sig = 10246470789152800045 2017-08-04 09:12:34.824427 7f00971c4700 10 mon.b@0(leader).paxosservice(mdsmap 1..1) refresh 2017-08-04 09:12:34.824441 7f00971c4700 10 mon.b@0(leader).paxosservice(osdmap 1..39) refresh 2017-08-04 09:12:34.824443 7f00971c4700 15 mon.b@0(leader).osd e38 update_from_paxos paxos e 39, my e 38 2017-08-04 09:12:34.824465 7f00971c4700 7 mon.b@0(leader).osd e38 update_from_paxos loading creating_pgs last_scan_epoch 38 with 0 pgs 2017-08-04 09:12:34.824467 7f00971c4700 10 mon.b@0(leader).osd e38 update_from_paxos pgservice is pg 2017-08-04 09:12:34.824477 7f00971c4700 7 mon.b@0(leader).osd e38 update_from_paxos applying incremental 39 2017-08-04 09:12:34.824538 7f00971c4700 1 mon.b@0(leader).osd e39 e39: 6 total, 6 up, 5 in 2017-08-04 09:12:34.824540 7f00971c4700 10 mon.b@0(leader).osd e39 update_from_paxos pgservice is pg 2017-08-04 09:12:35.482412 7f009b9cd700 1 -- 172.21.15.184:6789/0 <== osd.2 172.21.15.194:6808/19339 50 ==== pg_stats(0 pgs tid 21 v 0) v1 ==== 152+0+0 (2273931122 0 0) 0x559a329e3000 con 0x559a31e6d800 2017-08-04 09:12:35.984346 7f009a9cb700 1 -- 172.21.15.184:0/48971 --> 172.21.15.194:6800/19186 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x559a329ff500 con 0 2017-08-04 09:12:35.984451 7f00989c7700 10 _calc_signature seq 23 front_crc_ = 363387025 middle_crc = 0 data_crc = 0 sig = 1319237725359407097 2017-08-04 09:12:35.984461 7f00989c7700 20 Putting signature in client message(seq # 23): sig = 1319237725359407097 2017-08-04 09:12:36.889334 7f00981c6700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/4037222019 conn(0x559a32899800 :6789 s=STATE_OPEN pgs=383 cs=1 l=1).tick idle(6471609) more than 5000000 us, mark self fault. 2017-08-04 09:12:36.889354 7f00981c6700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/4037222019 conn(0x559a32899800 :6789 s=STATE_OPEN pgs=383 cs=1 l=1).fault on lossy channel, failing 2017-08-04 09:12:36.889985 7f00981c6700 1 -- 172.21.15.184:6789/0 >> - conn(0x559a32052800 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=33 - 2017-08-04 09:12:36.890277 7f00981c6700 10 mon.b@0(leader) e1 ms_verify_authorizer 172.21.15.194:0/4037222019 client protocol 0 2017-08-04 09:12:36.890293 7f00981c6700 10 In get_auth_session_handler for protocol 0 2017-08-04 09:12:37.390102 7f00979c5700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/387627644 conn(0x559a3283f000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk peer close file descriptor 39 2017-08-04 09:12:37.390117 7f00979c5700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/387627644 conn(0x559a3283f000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).read_until read failed 2017-08-04 09:12:37.390120 7f00979c5700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/387627644 conn(0x559a3283f000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).process read tag failed 2017-08-04 09:12:37.390122 7f00979c5700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/387627644 conn(0x559a3283f000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).fault on lossy channel, failing 2017-08-04 09:12:37.680954 7f00981c6700 10 _calc_signature seq 950 front_crc_ = 4210009488 middle_crc = 0 data_crc = 0 sig = 3498222513807791856 2017-08-04 09:12:37.681138 7f00979c5700 10 _calc_signature seq 459 front_crc_ = 117879346 middle_crc = 0 data_crc = 0 sig = 13361952649768186819 2017-08-04 09:12:37.701330 7f00989c7700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/270168725 conn(0x559a31f37000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).tick idle(5003707) more than 5000000 us, mark self fault. 2017-08-04 09:12:37.701351 7f00989c7700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/270168725 conn(0x559a31f37000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).fault on lossy channel, failing 2017-08-04 09:12:37.711773 7f00981c6700 10 _calc_signature seq 951 front_crc_ = 4148781320 middle_crc = 0 data_crc = 0 sig = 11699419807093599941 2017-08-04 09:12:37.712053 7f00979c5700 10 _calc_signature seq 460 front_crc_ = 4148781320 middle_crc = 0 data_crc = 0 sig = 13382070068259793084 2017-08-04 09:12:40.322037 7f00981c6700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:6801/19337 conn(0x559a32898000 :6789 s=STATE_OPEN pgs=45 cs=1 l=1).read_bulk peer close file descriptor 38 2017-08-04 09:12:40.322055 7f00981c6700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:6801/19337 conn(0x559a32898000 :6789 s=STATE_OPEN pgs=45 cs=1 l=1).read_until read failed 2017-08-04 09:12:40.322057 7f00981c6700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:6801/19337 conn(0x559a32898000 :6789 s=STATE_OPEN pgs=45 cs=1 l=1).process read tag failed 2017-08-04 09:12:40.322060 7f00981c6700 1 -- 172.21.15.184:6789/0 >> 172.21.15.194:6801/19337 conn(0x559a32898000 :6789 s=STATE_OPEN pgs=45 cs=1 l=1).fault on lossy channel, failing 2017-08-04 09:12:40.348760 7f00971c4700 5 mon.b@0(leader).paxos(paxos refresh c 1..182) is_readable = 0 - now=2017-08-04 09:12:40.348766 lease_expire=0.000000 has v0 lc 182 2017-08-04 09:12:40.348783 7f00971c4700 10 mon.b@0(leader).pg v59 check_osd_map -- osdmap not readable, waiting 2017-08-04 09:12:40.348786 7f00971c4700 10 mon.b@0(leader).osd e39 check_osdmap_subs
this leads to
"2017-08-04 09:12:47.719241 mon.a mon.1 172.21.15.194:6789/0 255 : cluster [WRN] Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log
/a/sage-2017-08-04_05:26:53-rados-wip-sage-testing2-20170803b-distro-basic-smithi/1482337
similar instance:
/a/pdonnell-2017-08-03_09:07:41-multimds-wip-pdonnell-testing-20170802b-testing-basic-smithi/1478633
2017-08-03 09:59:16.314260 7feaf65d3700 10 mon.b@0(leader).health tick 2017-08-03 09:59:16.314262 7feaf65d3700 20 mon.b@0(leader).health check_member_health 2017-08-03 09:59:16.314364 7feaf65d3700 10 mon.b@0(leader).health check_member_health avail 93% total 916 GB, used 14124 MB, avail 856 GB 2017-08-03 09:59:16.314368 7feaf65d3700 20 mon.b@0(leader).health check_leader_health 2017-08-03 09:59:16.314376 7feaf65d3700 20 mon.b@0(leader) e1 sync_trim_providers 2017-08-03 09:59:17.326964 7feaf3dce700 1 -- 172.21.15.44:6789/0 <== client.4098 172.21.15.157:0/215587146 310 ==== mgrbeacon mgr.y(011bc71d-db3f-45ab-926e-bbf337d32ef3,4098, 172.21.15.157:6800/15683, 1) v5 ==== 621+0+0 (2036550175 0 0) 0x2a51ee7b80 con 0x2a51bc9000 2017-08-03 09:59:17.327008 7feaf3dce700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x2a5112e780 for client.4098 172.21.15.157:0/215587146 2017-08-03 09:59:17.327017 7feaf3dce700 20 mon.b@0(leader) e1 caps allow * 2017-08-03 09:59:17.327021 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgr 1..3) dispatch 0x2a51ee7b80 mgrbeacon mgr.y(011bc71d-db3f-45ab-926e-bbf337d32ef3,4098, 172.21.15.157:6800/15683, 1) v5 from client.4098 172.21.15.157:0/215587146 con 0x2a51bc9000 2017-08-03 09:59:17.327032 7feaf3dce700 5 mon.b@0(leader).paxos(paxos active c 1..173) is_readable = 1 - now=2017-08-03 09:59:17.327033 lease_expire=2017-08-03 09:59:20.488499 has v0 lc 173 2017-08-03 09:59:17.327043 7feaf3dce700 4 mon.b@0(leader).mgr e3 beacon from 4098 2017-08-03 09:59:17.327047 7feaf3dce700 20 is_capable service=mgr command= exec on cap allow * 2017-08-03 09:59:17.327049 7feaf3dce700 20 allow so far , doing grant allow * 2017-08-03 09:59:17.327051 7feaf3dce700 20 allow all 2017-08-03 09:59:17.327053 7feaf3dce700 4 mon.b@0(leader).mgr e3 beacon from 4098 2017-08-03 09:59:17.327058 7feaf3dce700 10 mon.b@0(leader).mgr e3 no change 2017-08-03 09:59:17.327074 7feaf3dce700 1 -- 172.21.15.44:6789/0 <== client.4098 172.21.15.157:0/215587146 311 ==== monmgrreport(0 checks) v1 ==== 32+0+2996 (4242266613 0 4169711827) 0x2a51e1ca80 con 0x2a51bc9000 2017-08-03 09:59:17.327085 7feaf3dce700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x2a5112e780 for client.4098 172.21.15.157:0/215587146 2017-08-03 09:59:17.327090 7feaf3dce700 20 mon.b@0(leader) e1 caps allow * 2017-08-03 09:59:17.327093 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgrstat 1..78) dispatch 0x2a51e1ca80 monmgrreport(0 checks) v1 from client.4098 172.21.15.157:0/215587146 con 0x2a51bc9000 2017-08-03 09:59:17.327099 7feaf3dce700 5 mon.b@0(leader).paxos(paxos active c 1..173) is_readable = 1 - now=2017-08-03 09:59:17.327099 lease_expire=2017-08-03 09:59:20.488499 has v0 lc 173 2017-08-03 09:59:17.327132 7feaf3dce700 10 mon.b@0(leader).mgrstat prepare_report 20 pgs: 20 active+clean; 260 MB data, 1935 MB used, 556 GB / 558 GB avail; 1640 kB/s wr, 208 op/s, 0 health checks 2017-08-03 09:59:17.327152 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgrstat 1..78) setting proposal_timer 0x2a515b5da0 with delay of 0.05 2017-08-03 09:59:17.377276 7feaf65d3700 10 mon.b@0(leader).paxosservice(mgrstat 1..78) propose_pending 2017-08-03 09:59:17.377335 7feaf65d3700 10 mon.b@0(leader).mgrstat 79 2017-08-03 09:59:17.377369 7feaf65d3700 10 mon.b@0(leader) e1 log_health updated 0 previous 0 2017-08-03 09:59:17.377374 7feaf65d3700 5 mon.b@0(leader).paxos(paxos active c 1..173) queue_pending_finisher 0x2a510e6bf0 2017-08-03 09:59:17.377379 7feaf65d3700 10 mon.b@0(leader).paxos(paxos active c 1..173) trigger_propose active, proposing now 2017-08-03 09:59:17.377387 7feaf65d3700 10 mon.b@0(leader).paxos(paxos active c 1..173) propose_pending 174 3174 bytes 2017-08-03 09:59:17.377390 7feaf65d3700 10 mon.b@0(leader).paxos(paxos updating c 1..173) begin for 174 3174 bytes 2017-08-03 09:59:17.501171 7feaf3dce700 1 -- 172.21.15.44:6789/0 <== client.4104 172.21.15.44:0/3702801273 100 ==== mgrbeacon mgr.x(011bc71d-db3f-45ab-926e-bbf337d32ef3,4104, -, 0) v5 ==== 633+0+0 (3938939177 0 0) 0x2a513e9440 con 0x2a5117b800 2017-08-03 09:59:17.766788 7feaf2dcc700 1 -- 172.21.15.44:0/15872 --> 172.21.15.157:6800/15683 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x2a513e8100 con 0 2017-08-03 09:59:17.766934 7feaf05c7700 10 _calc_signature seq 39 front_crc_ = 25535130 middle_crc = 0 data_crc = 0 sig = 13990729107905770019 2017-08-03 09:59:17.766951 7feaf05c7700 20 Putting signature in client message(seq # 39): sig = 13990729107905770019 2017-08-03 09:59:20.075136 7feaefdc6700 10 _calc_signature seq 592 front_crc_ = 2125601142 middle_crc = 0 data_crc = 0 sig = 12043567740073041252 2017-08-03 09:59:20.095056 7feaf0dc8700 10 _calc_signature seq 600 front_crc_ = 3392420486 middle_crc = 0 data_crc = 0 sig = 5179430677477228157 2017-08-03 09:59:20.098064 7feaefdc6700 10 _calc_signature seq 593 front_crc_ = 680283230 middle_crc = 0 data_crc = 0 sig = 9515446695652356164 2017-08-03 09:59:20.102817 7feaefdc6700 10 _calc_signature seq 594 front_crc_ = 3236840030 middle_crc = 0 data_crc = 0 sig = 4269289248682593417 2017-08-03 09:59:20.104921 7feaf0dc8700 10 _calc_signature seq 601 front_crc_ = 4094339008 middle_crc = 0 data_crc = 0 sig = 16585562122699640275 2017-08-03 09:59:20.105831 7feaf0dc8700 10 _calc_signature seq 602 front_crc_ = 576293066 middle_crc = 0 data_crc = 0 sig = 13557405708396481067 2017-08-03 09:59:20.107283 7feaf0dc8700 10 _calc_signature seq 603 front_crc_ = 1154793487 middle_crc = 0 data_crc = 0 sig = 4662769214294276591 2017-08-03 09:59:20.112378 7feaefdc6700 10 _calc_signature seq 595 front_crc_ = 3730037593 middle_crc = 0 data_crc = 0 sig = 2099837470117633096 2017-08-03 09:59:22.767044 7feaf2dcc700 1 -- 172.21.15.44:0/15872 --> 172.21.15.157:6800/15683 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x2a513e78c0 con 0 2017-08-03 09:59:22.767128 7feaf05c7700 10 _calc_signature seq 40 front_crc_ = 2268122717 middle_crc = 0 data_crc = 0 sig = 2521156108740750434 2017-08-03 09:59:22.767137 7feaf05c7700 20 Putting signature in client message(seq # 40): sig = 2521156108740750434 2017-08-03 09:59:25.535179 7feaefdc6700 10 _calc_signature seq 596 front_crc_ = 913895036 middle_crc = 0 data_crc = 0 sig = 12951353490257451475 2017-08-03 09:59:25.535213 7feaf0dc8700 10 _calc_signature seq 604 front_crc_ = 3415341534 middle_crc = 0 data_crc = 0 sig = 14321213010753324442 2017-08-03 09:59:27.076294 7feaefdc6700 10 _calc_signature seq 597 front_crc_ = 21526225 middle_crc = 0 data_crc = 0 sig = 17499439560101081472 2017-08-03 09:59:27.076294 7feaf0dc8700 10 _calc_signature seq 605 front_crc_ = 21526225 middle_crc = 0 data_crc = 0 sig = 2950429098881522129 2017-08-03 09:59:27.767251 7feaf2dcc700 1 -- 172.21.15.44:0/15872 --> 172.21.15.157:6800/15683 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x2a513e6b00 con 0 2017-08-03 09:59:27.767363 7feaf05c7700 10 _calc_signature seq 41 front_crc_ = 1062427806 middle_crc = 0 data_crc = 0 sig = 6942056563840595773 2017-08-03 09:59:27.767381 7feaf05c7700 20 Putting signature in client message(seq # 41): sig = 6942056563840595773 2017-08-03 09:59:32.767657 7feaf2dcc700 1 -- 172.21.15.44:0/15872 --> 172.21.15.157:6800/15683 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x2a513e99c0 con 0 2017-08-03 09:59:32.767824 7feaf05c7700 10 _calc_signature seq 42 front_crc_ = 3708880217 middle_crc = 0 data_crc = 0 sig = 5546604638538954251 2017-08-03 09:59:32.767848 7feaf05c7700 20 Putting signature in client message(seq # 42): sig = 5546604638538954251 2017-08-03 09:59:32.894138 7feaf65d3700 10 mon.b@0(leader).paxos(paxos updating c 1..173) sending begin to mon.1 2017-08-03 09:59:32.894171 7feaf65d3700 1 -- 172.21.15.44:6789/0 _send_message--> mon.1 172.21.15.157:6789/0 -- paxos(begin lc 173 fc 0 pn 500 opn 0) v4 -- ?+0 0x2a51fb2d00 2017-08-03 09:59:32.894187 7feaf65d3700 1 -- 172.21.15.44:6789/0 --> 172.21.15.157:6789/0 -- paxos(begin lc 173 fc 0 pn 500 opn 0) v4 -- 0x2a51fb2d00 con 0 2017-08-03 09:59:32.894221 7feaf65d3700 10 mon.b@0(leader).paxos(paxos updating c 1..173) sending begin to mon.2 2017-08-03 09:59:32.894227 7feaf65d3700 1 -- 172.21.15.44:6789/0 _send_message--> mon.2 172.21.15.157:6790/0 -- paxos(begin lc 173 fc 0 pn 500 opn 0) v4 -- ?+0 0x2a51c93c00 2017-08-03 09:59:32.894234 7feaf65d3700 1 -- 172.21.15.44:6789/0 --> 172.21.15.157:6790/0 -- paxos(begin lc 173 fc 0 pn 500 opn 0) v4 -- 0x2a51c93c00 con 0 2017-08-03 09:59:32.894335 7feaf0dc8700 10 _calc_signature seq 747 front_crc_ = 715930277 middle_crc = 0 data_crc = 0 sig = 15386536781051375182 2017-08-03 09:59:32.894336 7feaefdc6700 10 _calc_signature seq 731 front_crc_ = 3439589549 middle_crc = 0 data_crc = 0 sig = 9333911642960427553 2017-08-03 09:59:32.894326 7feaf3dce700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x2a5112e480 for client.? 172.21.15.44:0/3702801273 2017-08-03 09:59:32.894345 7feaf0dc8700 20 Putting signature in client message(seq # 747): sig = 15386536781051375182 2017-08-03 09:59:32.894346 7feaefdc6700 20 Putting signature in client message(seq # 731): sig = 9333911642960427553 2017-08-03 09:59:32.894347 7feaf3dce700 20 mon.b@0(leader) e1 caps allow * 2017-08-03 09:59:32.894353 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgr 1..3) dispatch 0x2a513e9440 mgrbeacon mgr.x(011bc71d-db3f-45ab-926e-bbf337d32ef3,4104, -, 0) v5 from client.4104 172.21.15.44:0/3702801273 con 0x2a5117b800 2017-08-03 09:59:32.894367 7feaf3dce700 5 mon.b@0(leader).paxos(paxos updating c 1..173) is_readable = 0 - now=2017-08-03 09:59:32.894367 lease_expire=2017-08-03 09:59:20.488499 has v0 lc 173 2017-08-03 09:59:32.894376 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgr 1..3) waiting for paxos -> readable (v0) 2017-08-03 09:59:32.894382 7feaf3dce700 5 mon.b@0(leader).paxos(paxos updating c 1..173) is_readable = 0 - now=2017-08-03 09:59:32.894382 lease_expire=2017-08-03 09:59:20.488499 has v0 lc 173 2017-08-03 09:59:32.894410 7feaf65d3700 11 mon.b@0(leader) e1 tick
Updated by Sage Weil over 6 years ago
- Priority changed from High to Urgent
/a/sage-2017-08-06_13:59:55-rados-wip-sage-testing-20170805a-distro-basic-smithi/1490103
seeing a lot of these.
Updated by Sage Weil over 6 years ago
/a/sage-2017-08-11_17:22:37-rados-wip-sage-testing-20170811a-distro-basic-smithi/1511996
Updated by Sage Weil over 6 years ago
/a/sage-2017-08-11_21:54:20-rados-luminous-distro-basic-smithi/1512264
I'm going to whitelist this on luminous branch for now just to hide the noise, but leave it there on master.
Updated by Sage Weil over 6 years ago
- Status changed from 12 to Resolved
the problem is that bluestore logs so freaking much at debug bluestore = 30 that the mon gets all laggy.
Updated by Sage Weil over 6 years ago
- Status changed from Resolved to Pending Backport
- Backport set to luminous
Updated by Nathan Cutler over 6 years ago
master PR for backport: https://github.com/ceph/ceph/pull/17505
Updated by Nathan Cutler over 6 years ago
- Copied to Backport #21283: luminous: spurious MON_DOWN, apparently slow/laggy mon added
Updated by Sage Weil over 6 years ago
- Status changed from Pending Backport to In Progress
Ok, this is still happening.. and it correlated with (1) bluestore and (2) bluestore fsck on mount, which spews an ungodly amount of log output. For example,
/a/sage-2017-09-07_15:28:50-rados-wip-sage-testing2-2017-09-07-0919-distro-basic-smithi/1605258
triggers MON_DOWN at
2017-09-07 16:01:01.628035 mon.a mon.1 172.21.15.195:6789/0 131 : cluster [WRN] Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)
on smithi101, right after
2017-09-07T16:00:51.620 INFO:teuthology.orchestra.run.smithi101:Running: 'sudo adjust-ulimits ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --log-file=/var/log/ceph/objectstore_tool.\\$pid.log --op import --file /home/ubuntu/cephtest/ceph.data/exp.4.5.6'
Updated by Nathan Cutler over 6 years ago
- Status changed from In Progress to Resolved
Updated by Sage Weil over 6 years ago
- Status changed from In Progress to Resolved
Updated by Greg Farnum over 6 years ago
Moved the /var/log/ceph into NVMe and haven't seen it since.