Project

General

Profile

Actions

Bug #20624

closed

cluster [WRN] Health check failed: no active mgr (MGR_DOWN)" in cluster log

Added by Kefu Chai almost 7 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

mgr.x

2017-07-13 18:33:19.097142 7f7539f55700 10 mgr tick tick
2017-07-13 18:33:19.097154 7f7539f55700  1 mgr send_beacon active
2017-07-13 18:33:19.097228 7f7539f55700 10 mgr send_beacon sending beacon as gid 4104 modules dashboard,restful,status,zabbix
2017-07-13 18:33:19.097248 7f7539f55700  1 -- 172.21.15.24:0/981164730 --> 172.21.15.59:6791/0 -- mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,4104, 172.21.15.24:6800/19514
0, 1) v3 -- ?+0 0x7f7550983400 con 0x7f755078d200
...
2017-07-13 18:33:19.300288 7f7545762700  0 -- 172.21.15.24:0/981164730 >> 172.21.15.59:6791/0 pipe(0x7f7550414000 sd=8 :60336 s=2 pgs=74 cs=1 l=1 c=0x7f755078d200).injecting socket
 failure
2017-07-13 18:33:19.302524 7f753e260700  1 -- 172.21.15.24:0/981164730 mark_down 0x7f755078d200 -- pipe dne
2017-07-13 18:33:19.302672 7f753e260700  1 -- 172.21.15.24:0/981164730 --> 172.21.15.24:6790/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x7f7550985c00 con 0x7f7550686200
2017-07-13 18:33:19.302685 7f753e260700  1 -- 172.21.15.24:0/981164730 --> 172.21.15.24:6792/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x7f7550985980 con 0x7f754edd4a00
2017-07-13 18:33:19.302708 7f753e260700  0 client.0 ms_handle_reset on 172.21.15.59:6791/0
2017-07-13 18:33:19.302712 7f753e260700  0 client.0 ms_handle_reset on 172.21.15.59:6791/0
2017-07-13 18:33:19.305209 7f753ba5b700  1 -- 172.21.15.24:0/981164730 >> 172.21.15.24:6790/0 pipe(0x7f75508fa000 sd=32 :54300 s=2 pgs=205 cs=1 l=1 c=0x7f7550686200).setting up a d
elay queue on Pipe 0x7f75508fa000
2017-07-13 18:33:19.309341 7f7545762700  1 -- 172.21.15.24:0/981164730 >> 172.21.15.24:6792/0 pipe(0x7f75508fc800 sd=8 :57656 s=2 pgs=158 cs=1 l=1 c=0x7f754edd4a00).setting up a de
lay queue on Pipe 0x7f75508fc800
...
2017-07-13 18:33:21.097582 7f7539f55700 10 mgr tick tick
2017-07-13 18:33:21.097593 7f7539f55700  1 mgr send_beacon active
2017-07-13 18:33:21.097671 7f7539f55700 10 mgr send_beacon sending beacon as gid 4104 modules dashboard,restful,status,zabbix
2017-07-13 18:33:21.097682 7f7539f55700 10 mgr tick
...
2017-07-13 18:33:21.885585 7f753ca5d700  1 -- 172.21.15.24:0/981164730 mark_down 0x7f754edd4a00 -- 0x7f75508fc800
2017-07-13 18:33:21.885634 7f753ca5d700  1 -- 172.21.15.24:0/981164730 mark_down 0x7f7550686200 -- 0x7f75508fa000
2017-07-13 18:33:21.885910 7f753ca5d700  1 -- 172.21.15.24:0/981164730 --> 172.21.15.24:6789/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x7f7550983b80 con 0x7f75506dd000
2017-07-13 18:33:21.885926 7f753ca5d700  1 -- 172.21.15.24:0/981164730 --> 172.21.15.59:6793/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x7f7550983900 con 0x7f75506dc200
2017-07-13 18:33:21.887931 7f753b95a700  0 -- 172.21.15.24:0/981164730 >> 172.21.15.24:6789/0 pipe(0x7f7550834800 sd=32 :0 s=1 pgs=0 cs=0 l=0 c=0x7f75506dd000).fault
2017-07-13 18:33:21.888159 7f7545762700  0 -- 172.21.15.24:0/981164730 >> 172.21.15.59:6793/0 pipe(0x7f7550414000 sd=8 :0 s=1 pgs=0 cs=0 l=0 c=0x7f75506dc200).fault

mon.g

2017-07-13 18:25:39.311329 7f167ee2c700  4 mon.g@2(peon).mgr e2 active server: -(4104)
..
2017-07-13 18:30:33.435350 7f167ee2c700 10 mon.g@2(peon) e1 ms_handle_reset 0x7f169327a900 172.21.15.24:0/981164730
2017-07-13 18:30:33.435362 7f167ee2c700 10 mon.g@2(peon) e1 reset/close on session client.4104 172.21.15.24:0/981164730
2017-07-13 18:30:33.435370 7f167ee2c700 10 mon.g@2(peon) e1 remove_session 0x7f16925d8480 client.4104 172.21.15.24:0/981164730 features 0xffddff8eea4fffb

2017-07-13 18:30:33.369089 7f167ee2c700 10 mon.g@2(peon).paxosservice(mgr 1..4)  discarding message from disconnected client client.4104 172.21.15.24:0/981164730 mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,4104, 172.21.15.24:6800/195140, 1) v3

2017-07-13 18:34:01.942682 7f1681631700  4 mon.g@2(leader).mgr e4 Dropping active0
2017-07-13 18:34:01.942685 7f1681631700  4 mon.g@2(leader).mgr e4 Active is laggy but have no standbys to replace it
2017-07-13 18:34:01.942687 7f1681631700 10 mon.g@2(leader).mgr e4  exceeded mon_mgr_mkfs_grace 60 seconds
2017-07-13 18:34:01.942689 7f1681631700 10 mon.g@2(leader).paxosservice(mgr 1..4) propose_pending

2017-07-13 18:34:01.942757 7f1681631700  0 log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)

2017-07-13 18:34:22.040297 7f167ee2c700  1 -- 172.21.15.24:6790/0 <== mon.7 172.21.15.59:6792/0 304 ==== forward(mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,94138, -, 0) v3 caps allow * tid 33 con_features 1152323339925389307) v3 ==== 307+0+0 (1330286810 0 0) 0x7f1692dc8000 con 0x7f16921b2f00

mon.f

2017-07-13 18:30:35.062253 7f39cf22e700  1 -- 172.21.15.24:6789/0 <== mon.1 172.21.15.59:6789/0 711233028 ==== forward(mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,4104, 172.21.15.24:6800/195140, 1) v3 caps allow * tid 23 con_features 1152323339925389307) v3 ==== 295+0+0 (3561408359 0 0) 0x7f39e375ea80 con 0x7f39e34e5000
..
2017-07-13 18:30:35.062315 7f39cf22e700  4 mon.f@0(leader).mgr e4 beacon from 4104
..
2017-07-13 18:33:11.095775 7f39cf22e700 10 mon.f@0(leader).paxosservice(mgr 1..4) dispatch 0x7f39e3dc3e00 mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,4104, 172.21.15.24:6800/195140, 1) v3 from client.4104 172.21.15.24:0/981164730 con 0x7f39e3711000
..
2017-07-13 18:33:11.542530 7f39cb226700 10 _calc_signature seq 1973526481 front_crc_ = 2008202662 middle_crc = 0 data_crc = 0 sig = 412644305609923577
------ restarted.
2017-07-13 18:35:03.596725 7ff828b2be40  0 ceph version 12.1.0-956-g5f5ec76 (5f5ec7631ad0dfc4e378730e75572c0a1a065661) luminous (rc), process (unknown), pid 197457

i think the mgr somehow failed to connect to a monitor after being disconnected. the connected monitor was stopped by mon thrash job. that's why the mon believed that it's laggy and dropped it.

        ms inject delay max: 1
        ms inject delay probability: 0.005
        ms inject delay type: mon
        ms inject internal delays: 0.002
        ms inject socket failures: 2500

seems that the ms settings delayed the transmission of messages in a destructive way..

/a/kchai-2017-07-13_18:13:10-rados-wip-kefu-testing-distro-basic-smithi/1396188


Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?)ResolvedSage Weil06/21/2017

Actions
Actions #1

Updated by Kefu Chai almost 7 years ago

  • Description updated (diff)
Actions #2

Updated by Kefu Chai almost 7 years ago

  • Description updated (diff)
Actions #3

Updated by Joao Eduardo Luis over 6 years ago

  • Related to Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?) added
Actions #4

Updated by Joao Eduardo Luis over 6 years ago

  • Related to deleted (Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?))
Actions #5

Updated by Joao Eduardo Luis over 6 years ago

  • Is duplicate of Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?) added
Actions #6

Updated by Joao Eduardo Luis over 6 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF