Project

General

Profile

Actions

Bug #20371

closed

mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?)

Added by Sage Weil almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

for a while,

2017-06-21 05:14:24.083186 7f21f4323700  1 mgr send_beacon active
2017-06-21 05:14:24.083189 7f21f4323700 10 mgr send_beacon sending beacon as gid 4098
2017-06-21 05:14:24.083203 7f21f4323700  1 -- 172.21.15.80:0/2657294848 --> 172.21.15.80:6789/0 -- mgrbeacon mgr.x(21c823a4-dba0-4be6-940a-e295fbe30f86,4098, 172.21.15.80:6800/237445, 1) v2 -- ?+0 0x55d9d0440a00 con 0x55d9cfbf4640
2017-06-21 05:14:24.083221 7f21f4323700  0 mgr tick 
2017-06-21 05:14:24.083223 7f21f4323700 10 mgr update_delta_stats  v16
2017-06-21 05:14:24.083305 7f21f4323700 10 mgr.server operator() 12 pgs: 12 active+clean; 0 bytes data, 3183 MB used, 266 GB / 270 GB avail
2017-06-21 05:14:24.083331 7f21f4323700  1 -- 172.21.15.80:0/2657294848 --> 172.21.15.80:6789/0 -- monmgrreport v1 -- ?+1544 0x55d9cfc0bd40 con 0x55d9cfbf4640
2017-06-21 05:14:26.083459 7f21f4323700  0 mgr tick tick
2017-06-21 05:14:26.083470 7f21f4323700  1 mgr send_beacon active
2017-06-21 05:14:26.083472 7f21f4323700 10 mgr send_beacon sending beacon as gid 4098
2017-06-21 05:14:26.083486 7f21f4323700  1 -- 172.21.15.80:0/2657294848 --> 172.21.15.80:6789/0 -- mgrbeacon mgr.x(21c823a4-dba0-4be6-940a-e295fbe30f86,4098, 172.21.15.80:6800/237445, 1) v2 -- ?+0 0x55d9d0440c80 con 0x55d9cfbf4640

but then
2017-06-21 05:14:28.083689 7f21f4323700  1 mgr send_beacon active
2017-06-21 05:14:28.083692 7f21f4323700 10 mgr send_beacon sending beacon as gid 4098
2017-06-21 05:14:28.083697 7f21f4323700  0 mgr tick 
2017-06-21 05:14:28.083699 7f21f4323700 10 mgr update_delta_stats  v18
2017-06-21 05:14:28.083799 7f21f4323700 10 mgr.server operator() 12 pgs: 12 active+clean; 0 bytes data, 3183 MB used, 266 GB / 270 GB avail
2017-06-21 05:14:28.110377 7f21f231f700 10 _calc_signature seq 6 front_crc_ = 145679454 middle_crc = 0 data_crc = 0 sig = 14427909843967850292
2017-06-21 05:14:28.110426 7f21f131d700  1 -- 172.21.15.80:6800/237445 <== mon.0 172.21.15.80:0/237444 6 ==== mgrreport(+0-0 packed 1734) v2 ==== 1751+0+0 (145679454 0 0) 0x55d9d064a2c0 con 0x55d9d0492800
2017-06-21 05:14:28.110456 7f21f131d700  4 mgr.server handle_report from 0x55d9d0492800 name mon.a
2017-06-21 05:14:28.110460 7f21f131d700 20 mgr.server handle_report updating existing DaemonState for a
2017-06-21 05:14:28.110462 7f21f131d700 20 mgr update loading 0 new types, 0 old types, had 590 types, got 1734 bytes of data
2017-06-21 05:14:30.083915 7f21f4323700  0 mgr tick tick
2017-06-21 05:14:30.083927 7f21f4323700  1 mgr send_beacon active
2017-06-21 05:14:30.083929 7f21f4323700 10 mgr send_beacon sending beacon as gid 4098
2017-06-21 05:14:30.083935 7f21f4323700  0 mgr tick 
2017-06-21 05:14:30.083937 7f21f4323700 10 mgr update_delta_stats  v19
2017-06-21 05:14:30.083969 7f21f4323700 10 mgr.server operator() 12 pgs: 12 active+clean; 0 bytes data, 3183 MB used, 266 GB / 270 GB avail
2017-06-21 05:14:31.876473 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> - conn(0x55d9d0477000 :6800 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=8 -
2017-06-21 05:14:31.876683 7f21f2b20700 10 cephx: verify_authorizer decrypted service mgr secret_id=2
2017-06-21 05:14:31.876769 7f21f2b20700 10 cephx: verify_authorizer global_id=4111
2017-06-21 05:14:31.876799 7f21f2b20700 10 cephx: verify_authorizer ok nonce 4b588f54542289ec reply_bl.length()=36
2017-06-21 05:14:31.876845 7f21f2b20700 10 mgr.server ms_verify_authorizer  session 0x55d9cfbfc060 client.admin has caps allow * 'allow *'
2017-06-21 05:14:31.876857 7f21f2b20700 10 In get_auth_session_handler for protocol 2
2017-06-21 05:14:31.894450 7f21f1b1e700 10 _calc_signature seq 12 front_crc_ = 4255501585 middle_crc = 0 data_crc = 0 sig = 14575324513775164604
2017-06-21 05:14:31.894490 7f21f131d700  1 -- 172.21.15.80:6800/237445 <== osd.1 172.21.15.80:6809/237544 12 ==== pg_stats(3 pgs tid 0 v 0) v1 ==== 1862+0+0 (4255501585 0 0) 0x55d9d04e6300 con 0x55d9d0491000
2017-06-21 05:14:31.896013 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> 172.21.15.80:0/2685737263 conn(0x55d9d0477000 :6800 s=STATE_OPEN pgs=3 cs=1 l=1).read_bulk peer close file descriptor 8
2017-06-21 05:14:31.896037 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> 172.21.15.80:0/2685737263 conn(0x55d9d0477000 :6800 s=STATE_OPEN pgs=3 cs=1 l=1).read_until read failed
2017-06-21 05:14:31.896043 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> 172.21.15.80:0/2685737263 conn(0x55d9d0477000 :6800 s=STATE_OPEN pgs=3 cs=1 l=1).process read tag failed
2017-06-21 05:14:31.896048 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> 172.21.15.80:0/2685737263 conn(0x55d9d0477000 :6800 s=STATE_OPEN pgs=3 cs=1 l=1).fault on lossy channel, failing
2017-06-21 05:14:32.018762 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> - conn(0x55d9d0644000 :6800 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=8 -
2017-06-21 05:14:32.018912 7f21f2b20700 10 cephx: verify_authorizer decrypted service mgr secret_id=2
2017-06-21 05:14:32.018976 7f21f2b20700 10 cephx: verify_authorizer global_id=4112
2017-06-21 05:14:32.018997 7f21f2b20700 10 cephx: verify_authorizer ok nonce 436c6125628c895d reply_bl.length()=36
2017-06-21 05:14:32.019043 7f21f2b20700 10 mgr.server ms_verify_authorizer  session 0x55d9cfbfc220 client.admin has caps allow * 'allow *'
2017-06-21 05:14:32.019058 7f21f2b20700 10 In get_auth_session_handler for protocol 2
2017-06-21 05:14:32.030584 7f21f1b1e700 10 _calc_signature seq 12 front_crc_ = 1980360948 middle_crc = 0 data_crc = 0 sig = 12748525025159258443
2017-06-21 05:14:32.030633 7f21f131d700  1 -- 172.21.15.80:6800/237445 <== osd.2 172.21.15.80:6805/237543 12 ==== pg_stats(3 pgs tid 0 v 0) v1 ==== 1862+0+0 (1980360948 0 0) 0x55d9d04e6900 con 0x55d9d0475800
2017-06-21 05:14:32.031801 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> 172.21.15.80:0/2058161578 conn(0x55d9d0644000 :6800 s=STATE_OPEN pgs=2 cs=1 l=1).read_bulk peer close file descriptor 8
2017-06-21 05:14:32.031827 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> 172.21.15.80:0/2058161578 conn(0x55d9d0644000 :6800 s=STATE_OPEN pgs=2 cs=1 l=1).read_until read failed
2017-06-21 05:14:32.031836 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> 172.21.15.80:0/2058161578 conn(0x55d9d0644000 :6800 s=STATE_OPEN pgs=2 cs=1 l=1).process read tag failed
2017-06-21 05:14:32.031845 7f21f2b20700  1 -- 172.21.15.80:6800/237445 >> 172.21.15.80:0/2058161578 conn(0x55d9d0644000 :6800 s=STATE_OPEN pgs=2 cs=1 l=1).fault on lossy channel, failing
2017-06-21 05:14:32.031883 7f21f2b20700  1 -- 172.21.15.80:6800/237445 reap_dead start

/a/sage-2017-06-21_02:01:04-rados-wip-sage-testing2-distro-basic-smithi/1308454

eventually the mgr times out and fails over, and meanwhile the pg stat flush times out with

Exception: timed out waiting for mon to be updated with osd.1: 21474836492
 < 21474836497

Related issues 2 (0 open2 closed)

Has duplicate RADOS - Bug #20507: "[WRN] Manager daemon x is unresponsive. No standby daemons available." in cluster logDuplicate07/05/2017

Actions
Has duplicate RADOS - Bug #20624: cluster [WRN] Health check failed: no active mgr (MGR_DOWN)" in cluster logDuplicate07/14/2017

Actions
Actions

Also available in: Atom PDF