Project

General

Profile

Actions

Bug #20910

closed

spurious MON_DOWN, apparently slow/laggy mon

Added by Sage Weil over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

mon shows very slow progress for ~10 seconds, failing to send lease renewals etc, and triggering an election

2017-08-04 09:12:32.645368 7f009b9cd700 10 mon.b@0(leader).paxos(paxos updating c 1..181) handle_accept paxos(accept lc 181 fc 0 pn 300 opn 0) v4
2017-08-04 09:12:32.645371 7f009b9cd700 10 mon.b@0(leader).paxos(paxos updating c 1..181)  now 0,1,2 have accepted
2017-08-04 09:12:32.645373 7f009b9cd700 10 mon.b@0(leader).paxos(paxos updating c 1..181)  got majority, committing, done with update
2017-08-04 09:12:32.645375 7f009b9cd700 10 mon.b@0(leader).paxos(paxos updating c 1..181) commit_start 182
2017-08-04 09:12:32.699610 7f00989c7700  1 -- 172.21.15.184:6789/0 >> - conn(0x559a31f37000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=40 -
2017-08-04 09:12:32.699765 7f00989c7700 10 mon.b@0(leader) e1 ms_verify_authorizer 172.21.15.194:0/270168725 client protocol 0
2017-08-04 09:12:32.699775 7f00989c7700 10 In get_auth_session_handler for protocol 0
2017-08-04 09:12:32.699912 7f009b9cd700  1 -- 172.21.15.184:6789/0 <== client.? 172.21.15.194:0/270168725 1 ==== auth(proto 0 30 bytes epoch 0) v1 ==== 60+0+0 (673663173 0 0) 0x559a31c2ec80 con 0x559a31f37000
2017-08-04 09:12:32.699932 7f009b9cd700 10 mon.b@0(leader) e1 _ms_dispatch new session 0x559a32837d40 MonSession(client.? 172.21.15.194:0/270168725 is open ) features 0x1ffddff8eea4fffb
2017-08-04 09:12:32.699937 7f009b9cd700 20 mon.b@0(leader) e1  caps 
2017-08-04 09:12:32.699939 7f009b9cd700 10 mon.b@0(leader).paxosservice(auth 1..7) dispatch 0x559a31c2ec80 auth(proto 0 30 bytes epoch 0) v1 from client.? 172.21.15.194:0/270168725 con 0x559a31f37000
2017-08-04 09:12:32.699945 7f009b9cd700  5 mon.b@0(leader).paxos(paxos writing c 1..181) is_readable = 0 - now=2017-08-04 09:12:32.699946 lease_expire=2017-08-04 09:12:32.350979 has v0 lc 181
2017-08-04 09:12:32.699965 7f009b9cd700 10 mon.b@0(leader).paxosservice(auth 1..7)  waiting for paxos -> readable (v0)
2017-08-04 09:12:32.699972 7f009b9cd700  5 mon.b@0(leader).paxos(paxos writing c 1..181) is_readable = 0 - now=2017-08-04 09:12:32.699972 lease_expire=2017-08-04 09:12:32.350979 has v0 lc 181
2017-08-04 09:12:34.824279 7f00971c4700 20 mon.b@0(leader).paxos(paxos writing c 1..181) commit_finish 182
2017-08-04 09:12:34.824323 7f00971c4700 10 mon.b@0(leader).paxos(paxos writing c 1..182)  sending commit to mon.1
2017-08-04 09:12:34.824335 7f00971c4700  1 -- 172.21.15.184:6789/0 _send_message--> mon.1 172.21.15.194:6789/0 -- paxos(commit lc 182 fc 0 pn 300 opn 0) v4 -- ?+0 0x559a32864080
2017-08-04 09:12:34.824347 7f00971c4700  1 -- 172.21.15.184:6789/0 --> 172.21.15.194:6789/0 -- paxos(commit lc 182 fc 0 pn 300 opn 0) v4 -- 0x559a32864080 con 0
2017-08-04 09:12:34.824361 7f00971c4700 10 mon.b@0(leader).paxos(paxos writing c 1..182)  sending commit to mon.2
2017-08-04 09:12:34.824365 7f00971c4700  1 -- 172.21.15.184:6789/0 _send_message--> mon.2 172.21.15.194:6790/0 -- paxos(commit lc 182 fc 0 pn 300 opn 0) v4 -- ?+0 0x559a32864700
2017-08-04 09:12:34.824369 7f00971c4700  1 -- 172.21.15.184:6789/0 --> 172.21.15.194:6790/0 -- paxos(commit lc 182 fc 0 pn 300 opn 0) v4 -- 0x559a32864700 con 0
2017-08-04 09:12:34.824379 7f00971c4700 10 mon.b@0(leader) e1 refresh_from_paxos
2017-08-04 09:12:34.824398 7f00971c4700 10 mon.b@0(leader).paxosservice(pgmap 1..59) refresh
2017-08-04 09:12:34.824406 7f00979c5700 10 _calc_signature seq 630 front_crc_ = 3022017222 middle_crc = 0 data_crc = 0 sig = 10246470789152800045
2017-08-04 09:12:34.824406 7f00981c6700 10 _calc_signature seq 876 front_crc_ = 1529781274 middle_crc = 0 data_crc = 0 sig = 87697745521067598
2017-08-04 09:12:34.824412 7f00981c6700 20 Putting signature in client message(seq # 876): sig = 87697745521067598
2017-08-04 09:12:34.824417 7f00979c5700 20 Putting signature in client message(seq # 630): sig = 10246470789152800045
2017-08-04 09:12:34.824427 7f00971c4700 10 mon.b@0(leader).paxosservice(mdsmap 1..1) refresh
2017-08-04 09:12:34.824441 7f00971c4700 10 mon.b@0(leader).paxosservice(osdmap 1..39) refresh
2017-08-04 09:12:34.824443 7f00971c4700 15 mon.b@0(leader).osd e38 update_from_paxos paxos e 39, my e 38
2017-08-04 09:12:34.824465 7f00971c4700  7 mon.b@0(leader).osd e38 update_from_paxos loading creating_pgs last_scan_epoch 38 with 0 pgs
2017-08-04 09:12:34.824467 7f00971c4700 10 mon.b@0(leader).osd e38 update_from_paxos pgservice is pg
2017-08-04 09:12:34.824477 7f00971c4700  7 mon.b@0(leader).osd e38 update_from_paxos  applying incremental 39
2017-08-04 09:12:34.824538 7f00971c4700  1 mon.b@0(leader).osd e39 e39: 6 total, 6 up, 5 in
2017-08-04 09:12:34.824540 7f00971c4700 10 mon.b@0(leader).osd e39 update_from_paxos pgservice is pg
2017-08-04 09:12:35.482412 7f009b9cd700  1 -- 172.21.15.184:6789/0 <== osd.2 172.21.15.194:6808/19339 50 ==== pg_stats(0 pgs tid 21 v 0) v1 ==== 152+0+0 (2273931122 0 0) 0x559a329e3000 con 0x559a31e6d800
2017-08-04 09:12:35.984346 7f009a9cb700  1 -- 172.21.15.184:0/48971 --> 172.21.15.194:6800/19186 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x559a329ff500 con 0
2017-08-04 09:12:35.984451 7f00989c7700 10 _calc_signature seq 23 front_crc_ = 363387025 middle_crc = 0 data_crc = 0 sig = 1319237725359407097
2017-08-04 09:12:35.984461 7f00989c7700 20 Putting signature in client message(seq # 23): sig = 1319237725359407097
2017-08-04 09:12:36.889334 7f00981c6700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/4037222019 conn(0x559a32899800 :6789 s=STATE_OPEN pgs=383 cs=1 l=1).tick idle(6471609) more than 5000000 us, mark self fault.
2017-08-04 09:12:36.889354 7f00981c6700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/4037222019 conn(0x559a32899800 :6789 s=STATE_OPEN pgs=383 cs=1 l=1).fault on lossy channel, failing
2017-08-04 09:12:36.889985 7f00981c6700  1 -- 172.21.15.184:6789/0 >> - conn(0x559a32052800 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=33 -
2017-08-04 09:12:36.890277 7f00981c6700 10 mon.b@0(leader) e1 ms_verify_authorizer 172.21.15.194:0/4037222019 client protocol 0
2017-08-04 09:12:36.890293 7f00981c6700 10 In get_auth_session_handler for protocol 0
2017-08-04 09:12:37.390102 7f00979c5700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/387627644 conn(0x559a3283f000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk peer close file descriptor 39
2017-08-04 09:12:37.390117 7f00979c5700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/387627644 conn(0x559a3283f000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).read_until read failed
2017-08-04 09:12:37.390120 7f00979c5700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/387627644 conn(0x559a3283f000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).process read tag failed
2017-08-04 09:12:37.390122 7f00979c5700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/387627644 conn(0x559a3283f000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).fault on lossy channel, failing
2017-08-04 09:12:37.680954 7f00981c6700 10 _calc_signature seq 950 front_crc_ = 4210009488 middle_crc = 0 data_crc = 0 sig = 3498222513807791856
2017-08-04 09:12:37.681138 7f00979c5700 10 _calc_signature seq 459 front_crc_ = 117879346 middle_crc = 0 data_crc = 0 sig = 13361952649768186819
2017-08-04 09:12:37.701330 7f00989c7700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/270168725 conn(0x559a31f37000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).tick idle(5003707) more than 5000000 us, mark self fault.
2017-08-04 09:12:37.701351 7f00989c7700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:0/270168725 conn(0x559a31f37000 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).fault on lossy channel, failing
2017-08-04 09:12:37.711773 7f00981c6700 10 _calc_signature seq 951 front_crc_ = 4148781320 middle_crc = 0 data_crc = 0 sig = 11699419807093599941
2017-08-04 09:12:37.712053 7f00979c5700 10 _calc_signature seq 460 front_crc_ = 4148781320 middle_crc = 0 data_crc = 0 sig = 13382070068259793084
2017-08-04 09:12:40.322037 7f00981c6700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:6801/19337 conn(0x559a32898000 :6789 s=STATE_OPEN pgs=45 cs=1 l=1).read_bulk peer close file descriptor 38
2017-08-04 09:12:40.322055 7f00981c6700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:6801/19337 conn(0x559a32898000 :6789 s=STATE_OPEN pgs=45 cs=1 l=1).read_until read failed
2017-08-04 09:12:40.322057 7f00981c6700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:6801/19337 conn(0x559a32898000 :6789 s=STATE_OPEN pgs=45 cs=1 l=1).process read tag failed
2017-08-04 09:12:40.322060 7f00981c6700  1 -- 172.21.15.184:6789/0 >> 172.21.15.194:6801/19337 conn(0x559a32898000 :6789 s=STATE_OPEN pgs=45 cs=1 l=1).fault on lossy channel, failing
2017-08-04 09:12:40.348760 7f00971c4700  5 mon.b@0(leader).paxos(paxos refresh c 1..182) is_readable = 0 - now=2017-08-04 09:12:40.348766 lease_expire=0.000000 has v0 lc 182
2017-08-04 09:12:40.348783 7f00971c4700 10 mon.b@0(leader).pg v59 check_osd_map -- osdmap not readable, waiting
2017-08-04 09:12:40.348786 7f00971c4700 10 mon.b@0(leader).osd e39 check_osdmap_subs

this leads to
"2017-08-04 09:12:47.719241 mon.a mon.1 172.21.15.194:6789/0 255 : cluster [WRN] Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)" in cluster log

/a/sage-2017-08-04_05:26:53-rados-wip-sage-testing2-20170803b-distro-basic-smithi/1482337

similar instance:
/a/pdonnell-2017-08-03_09:07:41-multimds-wip-pdonnell-testing-20170802b-testing-basic-smithi/1478633

2017-08-03 09:59:16.314260 7feaf65d3700 10 mon.b@0(leader).health tick
2017-08-03 09:59:16.314262 7feaf65d3700 20 mon.b@0(leader).health check_member_health
2017-08-03 09:59:16.314364 7feaf65d3700 10 mon.b@0(leader).health check_member_health avail 93% total 916 GB, used 14124 MB, avail 856 GB
2017-08-03 09:59:16.314368 7feaf65d3700 20 mon.b@0(leader).health check_leader_health
2017-08-03 09:59:16.314376 7feaf65d3700 20 mon.b@0(leader) e1 sync_trim_providers
2017-08-03 09:59:17.326964 7feaf3dce700  1 -- 172.21.15.44:6789/0 <== client.4098 172.21.15.157:0/215587146 310 ==== mgrbeacon mgr.y(011bc71d-db3f-45ab-926e-bbf337d32ef3,4098, 172.21.15.157:6800/15683, 1) v5 ==== 621+0+0 (2036550175 0 0) 0x2a51ee7b80 con 0x2a51bc9000
2017-08-03 09:59:17.327008 7feaf3dce700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x2a5112e780 for client.4098 172.21.15.157:0/215587146
2017-08-03 09:59:17.327017 7feaf3dce700 20 mon.b@0(leader) e1  caps allow *
2017-08-03 09:59:17.327021 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgr 1..3) dispatch 0x2a51ee7b80 mgrbeacon mgr.y(011bc71d-db3f-45ab-926e-bbf337d32ef3,4098, 172.21.15.157:6800/15683, 1) v5 from client.4098 172.21.15.157:0/215587146 con 0x2a51bc9000
2017-08-03 09:59:17.327032 7feaf3dce700  5 mon.b@0(leader).paxos(paxos active c 1..173) is_readable = 1 - now=2017-08-03 09:59:17.327033 lease_expire=2017-08-03 09:59:20.488499 has v0 lc 173
2017-08-03 09:59:17.327043 7feaf3dce700  4 mon.b@0(leader).mgr e3 beacon from 4098
2017-08-03 09:59:17.327047 7feaf3dce700 20 is_capable service=mgr command= exec on cap allow *
2017-08-03 09:59:17.327049 7feaf3dce700 20  allow so far , doing grant allow *
2017-08-03 09:59:17.327051 7feaf3dce700 20  allow all
2017-08-03 09:59:17.327053 7feaf3dce700  4 mon.b@0(leader).mgr e3 beacon from 4098
2017-08-03 09:59:17.327058 7feaf3dce700 10 mon.b@0(leader).mgr e3 no change
2017-08-03 09:59:17.327074 7feaf3dce700  1 -- 172.21.15.44:6789/0 <== client.4098 172.21.15.157:0/215587146 311 ==== monmgrreport(0 checks) v1 ==== 32+0+2996 (4242266613 0 4169711827) 0x2a51e1ca80 con 0x2a51bc9000
2017-08-03 09:59:17.327085 7feaf3dce700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x2a5112e780 for client.4098 172.21.15.157:0/215587146
2017-08-03 09:59:17.327090 7feaf3dce700 20 mon.b@0(leader) e1  caps allow *
2017-08-03 09:59:17.327093 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgrstat 1..78) dispatch 0x2a51e1ca80 monmgrreport(0 checks) v1 from client.4098 172.21.15.157:0/215587146 con 0x2a51bc9000
2017-08-03 09:59:17.327099 7feaf3dce700  5 mon.b@0(leader).paxos(paxos active c 1..173) is_readable = 1 - now=2017-08-03 09:59:17.327099 lease_expire=2017-08-03 09:59:20.488499 has v0 lc 173
2017-08-03 09:59:17.327132 7feaf3dce700 10 mon.b@0(leader).mgrstat prepare_report 20 pgs: 20 active+clean; 260 MB data, 1935 MB used, 556 GB / 558 GB avail; 1640 kB/s wr, 208 op/s, 0 health checks
2017-08-03 09:59:17.327152 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgrstat 1..78)  setting proposal_timer 0x2a515b5da0 with delay of 0.05
2017-08-03 09:59:17.377276 7feaf65d3700 10 mon.b@0(leader).paxosservice(mgrstat 1..78) propose_pending
2017-08-03 09:59:17.377335 7feaf65d3700 10 mon.b@0(leader).mgrstat  79
2017-08-03 09:59:17.377369 7feaf65d3700 10 mon.b@0(leader) e1 log_health updated 0 previous 0
2017-08-03 09:59:17.377374 7feaf65d3700  5 mon.b@0(leader).paxos(paxos active c 1..173) queue_pending_finisher 0x2a510e6bf0
2017-08-03 09:59:17.377379 7feaf65d3700 10 mon.b@0(leader).paxos(paxos active c 1..173) trigger_propose active, proposing now
2017-08-03 09:59:17.377387 7feaf65d3700 10 mon.b@0(leader).paxos(paxos active c 1..173) propose_pending 174 3174 bytes
2017-08-03 09:59:17.377390 7feaf65d3700 10 mon.b@0(leader).paxos(paxos updating c 1..173) begin for 174 3174 bytes
2017-08-03 09:59:17.501171 7feaf3dce700  1 -- 172.21.15.44:6789/0 <== client.4104 172.21.15.44:0/3702801273 100 ==== mgrbeacon mgr.x(011bc71d-db3f-45ab-926e-bbf337d32ef3,4104, -, 0) v5 ==== 633+0+0 (3938939177 0 0) 0x2a513e9440 con 0x2a5117b800
2017-08-03 09:59:17.766788 7feaf2dcc700  1 -- 172.21.15.44:0/15872 --> 172.21.15.157:6800/15683 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x2a513e8100 con 0
2017-08-03 09:59:17.766934 7feaf05c7700 10 _calc_signature seq 39 front_crc_ = 25535130 middle_crc = 0 data_crc = 0 sig = 13990729107905770019
2017-08-03 09:59:17.766951 7feaf05c7700 20 Putting signature in client message(seq # 39): sig = 13990729107905770019
2017-08-03 09:59:20.075136 7feaefdc6700 10 _calc_signature seq 592 front_crc_ = 2125601142 middle_crc = 0 data_crc = 0 sig = 12043567740073041252
2017-08-03 09:59:20.095056 7feaf0dc8700 10 _calc_signature seq 600 front_crc_ = 3392420486 middle_crc = 0 data_crc = 0 sig = 5179430677477228157
2017-08-03 09:59:20.098064 7feaefdc6700 10 _calc_signature seq 593 front_crc_ = 680283230 middle_crc = 0 data_crc = 0 sig = 9515446695652356164
2017-08-03 09:59:20.102817 7feaefdc6700 10 _calc_signature seq 594 front_crc_ = 3236840030 middle_crc = 0 data_crc = 0 sig = 4269289248682593417
2017-08-03 09:59:20.104921 7feaf0dc8700 10 _calc_signature seq 601 front_crc_ = 4094339008 middle_crc = 0 data_crc = 0 sig = 16585562122699640275
2017-08-03 09:59:20.105831 7feaf0dc8700 10 _calc_signature seq 602 front_crc_ = 576293066 middle_crc = 0 data_crc = 0 sig = 13557405708396481067
2017-08-03 09:59:20.107283 7feaf0dc8700 10 _calc_signature seq 603 front_crc_ = 1154793487 middle_crc = 0 data_crc = 0 sig = 4662769214294276591
2017-08-03 09:59:20.112378 7feaefdc6700 10 _calc_signature seq 595 front_crc_ = 3730037593 middle_crc = 0 data_crc = 0 sig = 2099837470117633096
2017-08-03 09:59:22.767044 7feaf2dcc700  1 -- 172.21.15.44:0/15872 --> 172.21.15.157:6800/15683 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x2a513e78c0 con 0
2017-08-03 09:59:22.767128 7feaf05c7700 10 _calc_signature seq 40 front_crc_ = 2268122717 middle_crc = 0 data_crc = 0 sig = 2521156108740750434
2017-08-03 09:59:22.767137 7feaf05c7700 20 Putting signature in client message(seq # 40): sig = 2521156108740750434
2017-08-03 09:59:25.535179 7feaefdc6700 10 _calc_signature seq 596 front_crc_ = 913895036 middle_crc = 0 data_crc = 0 sig = 12951353490257451475
2017-08-03 09:59:25.535213 7feaf0dc8700 10 _calc_signature seq 604 front_crc_ = 3415341534 middle_crc = 0 data_crc = 0 sig = 14321213010753324442
2017-08-03 09:59:27.076294 7feaefdc6700 10 _calc_signature seq 597 front_crc_ = 21526225 middle_crc = 0 data_crc = 0 sig = 17499439560101081472
2017-08-03 09:59:27.076294 7feaf0dc8700 10 _calc_signature seq 605 front_crc_ = 21526225 middle_crc = 0 data_crc = 0 sig = 2950429098881522129
2017-08-03 09:59:27.767251 7feaf2dcc700  1 -- 172.21.15.44:0/15872 --> 172.21.15.157:6800/15683 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x2a513e6b00 con 0
2017-08-03 09:59:27.767363 7feaf05c7700 10 _calc_signature seq 41 front_crc_ = 1062427806 middle_crc = 0 data_crc = 0 sig = 6942056563840595773
2017-08-03 09:59:27.767381 7feaf05c7700 20 Putting signature in client message(seq # 41): sig = 6942056563840595773
2017-08-03 09:59:32.767657 7feaf2dcc700  1 -- 172.21.15.44:0/15872 --> 172.21.15.157:6800/15683 -- mgrreport(unknown.b +0-0 packed 1734) v4 -- 0x2a513e99c0 con 0
2017-08-03 09:59:32.767824 7feaf05c7700 10 _calc_signature seq 42 front_crc_ = 3708880217 middle_crc = 0 data_crc = 0 sig = 5546604638538954251
2017-08-03 09:59:32.767848 7feaf05c7700 20 Putting signature in client message(seq # 42): sig = 5546604638538954251
2017-08-03 09:59:32.894138 7feaf65d3700 10 mon.b@0(leader).paxos(paxos updating c 1..173)  sending begin to mon.1
2017-08-03 09:59:32.894171 7feaf65d3700  1 -- 172.21.15.44:6789/0 _send_message--> mon.1 172.21.15.157:6789/0 -- paxos(begin lc 173 fc 0 pn 500 opn 0) v4 -- ?+0 0x2a51fb2d00
2017-08-03 09:59:32.894187 7feaf65d3700  1 -- 172.21.15.44:6789/0 --> 172.21.15.157:6789/0 -- paxos(begin lc 173 fc 0 pn 500 opn 0) v4 -- 0x2a51fb2d00 con 0
2017-08-03 09:59:32.894221 7feaf65d3700 10 mon.b@0(leader).paxos(paxos updating c 1..173)  sending begin to mon.2
2017-08-03 09:59:32.894227 7feaf65d3700  1 -- 172.21.15.44:6789/0 _send_message--> mon.2 172.21.15.157:6790/0 -- paxos(begin lc 173 fc 0 pn 500 opn 0) v4 -- ?+0 0x2a51c93c00
2017-08-03 09:59:32.894234 7feaf65d3700  1 -- 172.21.15.44:6789/0 --> 172.21.15.157:6790/0 -- paxos(begin lc 173 fc 0 pn 500 opn 0) v4 -- 0x2a51c93c00 con 0
2017-08-03 09:59:32.894335 7feaf0dc8700 10 _calc_signature seq 747 front_crc_ = 715930277 middle_crc = 0 data_crc = 0 sig = 15386536781051375182
2017-08-03 09:59:32.894336 7feaefdc6700 10 _calc_signature seq 731 front_crc_ = 3439589549 middle_crc = 0 data_crc = 0 sig = 9333911642960427553
2017-08-03 09:59:32.894326 7feaf3dce700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x2a5112e480 for client.? 172.21.15.44:0/3702801273
2017-08-03 09:59:32.894345 7feaf0dc8700 20 Putting signature in client message(seq # 747): sig = 15386536781051375182
2017-08-03 09:59:32.894346 7feaefdc6700 20 Putting signature in client message(seq # 731): sig = 9333911642960427553
2017-08-03 09:59:32.894347 7feaf3dce700 20 mon.b@0(leader) e1  caps allow *
2017-08-03 09:59:32.894353 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgr 1..3) dispatch 0x2a513e9440 mgrbeacon mgr.x(011bc71d-db3f-45ab-926e-bbf337d32ef3,4104, -, 0) v5 from client.4104 172.21.15.44:0/3702801273 con 0x2a5117b800
2017-08-03 09:59:32.894367 7feaf3dce700  5 mon.b@0(leader).paxos(paxos updating c 1..173) is_readable = 0 - now=2017-08-03 09:59:32.894367 lease_expire=2017-08-03 09:59:20.488499 has v0 lc 173
2017-08-03 09:59:32.894376 7feaf3dce700 10 mon.b@0(leader).paxosservice(mgr 1..3)  waiting for paxos -> readable (v0)
2017-08-03 09:59:32.894382 7feaf3dce700  5 mon.b@0(leader).paxos(paxos updating c 1..173) is_readable = 0 - now=2017-08-03 09:59:32.894382 lease_expire=2017-08-03 09:59:20.488499 has v0 lc 173
2017-08-03 09:59:32.894410 7feaf65d3700 11 mon.b@0(leader) e1 tick


Related issues 1 (0 open1 closed)

Copied to RADOS - Backport #21283: luminous: spurious MON_DOWN, apparently slow/laggy monResolvedAbhishek LekshmananActions
Actions #1

Updated by Sage Weil over 6 years ago

  • Priority changed from High to Urgent

/a/sage-2017-08-06_13:59:55-rados-wip-sage-testing-20170805a-distro-basic-smithi/1490103

seeing a lot of these.

Actions #2

Updated by Sage Weil over 6 years ago

/a/sage-2017-08-11_17:22:37-rados-wip-sage-testing-20170811a-distro-basic-smithi/1511996

Actions #3

Updated by Sage Weil over 6 years ago

/a/sage-2017-08-11_21:54:20-rados-luminous-distro-basic-smithi/1512264

I'm going to whitelist this on luminous branch for now just to hide the noise, but leave it there on master.

https://github.com/ceph/ceph/pull/17003

Actions #4

Updated by Sage Weil over 6 years ago

  • Status changed from 12 to Resolved

the problem is that bluestore logs so freaking much at debug bluestore = 30 that the mon gets all laggy.

Actions #5

Updated by Sage Weil over 6 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to luminous
Actions #6

Updated by Nathan Cutler over 6 years ago

Actions #7

Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #21283: luminous: spurious MON_DOWN, apparently slow/laggy mon added
Actions #8

Updated by Sage Weil over 6 years ago

  • Status changed from Pending Backport to In Progress

Ok, this is still happening.. and it correlated with (1) bluestore and (2) bluestore fsck on mount, which spews an ungodly amount of log output. For example,

/a/sage-2017-09-07_15:28:50-rados-wip-sage-testing2-2017-09-07-0919-distro-basic-smithi/1605258

triggers MON_DOWN at

2017-09-07 16:01:01.628035 mon.a mon.1 172.21.15.195:6789/0 131 : cluster [WRN] Health check failed: 1/3 mons down, quorum a,c (MON_DOWN)

on smithi101, right after

2017-09-07T16:00:51.620 INFO:teuthology.orchestra.run.smithi101:Running: 'sudo adjust-ulimits ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --log-file=/var/log/ceph/objectstore_tool.\\$pid.log --op import --file /home/ubuntu/cephtest/ceph.data/exp.4.5.6'

Actions #9

Updated by Nathan Cutler over 6 years ago

  • Status changed from In Progress to Resolved
Actions #10

Updated by Sage Weil over 6 years ago

  • Status changed from Resolved to In Progress

not resolved yet!

Actions #11

Updated by Sage Weil over 6 years ago

  • Status changed from In Progress to Resolved
Actions #12

Updated by Greg Farnum over 6 years ago

Moved the /var/log/ceph into NVMe and haven't seen it since.

Actions

Also available in: Atom PDF