Actions
Bug #20624
closedcluster [WRN] Health check failed: no active mgr (MGR_DOWN)" in cluster log
Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
mgr.x
2017-07-13 18:33:19.097142 7f7539f55700 10 mgr tick tick 2017-07-13 18:33:19.097154 7f7539f55700 1 mgr send_beacon active 2017-07-13 18:33:19.097228 7f7539f55700 10 mgr send_beacon sending beacon as gid 4104 modules dashboard,restful,status,zabbix 2017-07-13 18:33:19.097248 7f7539f55700 1 -- 172.21.15.24:0/981164730 --> 172.21.15.59:6791/0 -- mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,4104, 172.21.15.24:6800/19514 0, 1) v3 -- ?+0 0x7f7550983400 con 0x7f755078d200 ... 2017-07-13 18:33:19.300288 7f7545762700 0 -- 172.21.15.24:0/981164730 >> 172.21.15.59:6791/0 pipe(0x7f7550414000 sd=8 :60336 s=2 pgs=74 cs=1 l=1 c=0x7f755078d200).injecting socket failure 2017-07-13 18:33:19.302524 7f753e260700 1 -- 172.21.15.24:0/981164730 mark_down 0x7f755078d200 -- pipe dne 2017-07-13 18:33:19.302672 7f753e260700 1 -- 172.21.15.24:0/981164730 --> 172.21.15.24:6790/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x7f7550985c00 con 0x7f7550686200 2017-07-13 18:33:19.302685 7f753e260700 1 -- 172.21.15.24:0/981164730 --> 172.21.15.24:6792/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x7f7550985980 con 0x7f754edd4a00 2017-07-13 18:33:19.302708 7f753e260700 0 client.0 ms_handle_reset on 172.21.15.59:6791/0 2017-07-13 18:33:19.302712 7f753e260700 0 client.0 ms_handle_reset on 172.21.15.59:6791/0 2017-07-13 18:33:19.305209 7f753ba5b700 1 -- 172.21.15.24:0/981164730 >> 172.21.15.24:6790/0 pipe(0x7f75508fa000 sd=32 :54300 s=2 pgs=205 cs=1 l=1 c=0x7f7550686200).setting up a d elay queue on Pipe 0x7f75508fa000 2017-07-13 18:33:19.309341 7f7545762700 1 -- 172.21.15.24:0/981164730 >> 172.21.15.24:6792/0 pipe(0x7f75508fc800 sd=8 :57656 s=2 pgs=158 cs=1 l=1 c=0x7f754edd4a00).setting up a de lay queue on Pipe 0x7f75508fc800 ... 2017-07-13 18:33:21.097582 7f7539f55700 10 mgr tick tick 2017-07-13 18:33:21.097593 7f7539f55700 1 mgr send_beacon active 2017-07-13 18:33:21.097671 7f7539f55700 10 mgr send_beacon sending beacon as gid 4104 modules dashboard,restful,status,zabbix 2017-07-13 18:33:21.097682 7f7539f55700 10 mgr tick ... 2017-07-13 18:33:21.885585 7f753ca5d700 1 -- 172.21.15.24:0/981164730 mark_down 0x7f754edd4a00 -- 0x7f75508fc800 2017-07-13 18:33:21.885634 7f753ca5d700 1 -- 172.21.15.24:0/981164730 mark_down 0x7f7550686200 -- 0x7f75508fa000 2017-07-13 18:33:21.885910 7f753ca5d700 1 -- 172.21.15.24:0/981164730 --> 172.21.15.24:6789/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x7f7550983b80 con 0x7f75506dd000 2017-07-13 18:33:21.885926 7f753ca5d700 1 -- 172.21.15.24:0/981164730 --> 172.21.15.59:6793/0 -- auth(proto 0 26 bytes epoch 1) v1 -- ?+0 0x7f7550983900 con 0x7f75506dc200 2017-07-13 18:33:21.887931 7f753b95a700 0 -- 172.21.15.24:0/981164730 >> 172.21.15.24:6789/0 pipe(0x7f7550834800 sd=32 :0 s=1 pgs=0 cs=0 l=0 c=0x7f75506dd000).fault 2017-07-13 18:33:21.888159 7f7545762700 0 -- 172.21.15.24:0/981164730 >> 172.21.15.59:6793/0 pipe(0x7f7550414000 sd=8 :0 s=1 pgs=0 cs=0 l=0 c=0x7f75506dc200).fault
mon.g
2017-07-13 18:25:39.311329 7f167ee2c700 4 mon.g@2(peon).mgr e2 active server: -(4104) .. 2017-07-13 18:30:33.435350 7f167ee2c700 10 mon.g@2(peon) e1 ms_handle_reset 0x7f169327a900 172.21.15.24:0/981164730 2017-07-13 18:30:33.435362 7f167ee2c700 10 mon.g@2(peon) e1 reset/close on session client.4104 172.21.15.24:0/981164730 2017-07-13 18:30:33.435370 7f167ee2c700 10 mon.g@2(peon) e1 remove_session 0x7f16925d8480 client.4104 172.21.15.24:0/981164730 features 0xffddff8eea4fffb 2017-07-13 18:30:33.369089 7f167ee2c700 10 mon.g@2(peon).paxosservice(mgr 1..4) discarding message from disconnected client client.4104 172.21.15.24:0/981164730 mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,4104, 172.21.15.24:6800/195140, 1) v3 2017-07-13 18:34:01.942682 7f1681631700 4 mon.g@2(leader).mgr e4 Dropping active0 2017-07-13 18:34:01.942685 7f1681631700 4 mon.g@2(leader).mgr e4 Active is laggy but have no standbys to replace it 2017-07-13 18:34:01.942687 7f1681631700 10 mon.g@2(leader).mgr e4 exceeded mon_mgr_mkfs_grace 60 seconds 2017-07-13 18:34:01.942689 7f1681631700 10 mon.g@2(leader).paxosservice(mgr 1..4) propose_pending 2017-07-13 18:34:01.942757 7f1681631700 0 log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN) 2017-07-13 18:34:22.040297 7f167ee2c700 1 -- 172.21.15.24:6790/0 <== mon.7 172.21.15.59:6792/0 304 ==== forward(mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,94138, -, 0) v3 caps allow * tid 33 con_features 1152323339925389307) v3 ==== 307+0+0 (1330286810 0 0) 0x7f1692dc8000 con 0x7f16921b2f00
mon.f
2017-07-13 18:30:35.062253 7f39cf22e700 1 -- 172.21.15.24:6789/0 <== mon.1 172.21.15.59:6789/0 711233028 ==== forward(mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,4104, 172.21.15.24:6800/195140, 1) v3 caps allow * tid 23 con_features 1152323339925389307) v3 ==== 295+0+0 (3561408359 0 0) 0x7f39e375ea80 con 0x7f39e34e5000 .. 2017-07-13 18:30:35.062315 7f39cf22e700 4 mon.f@0(leader).mgr e4 beacon from 4104 .. 2017-07-13 18:33:11.095775 7f39cf22e700 10 mon.f@0(leader).paxosservice(mgr 1..4) dispatch 0x7f39e3dc3e00 mgrbeacon mgr.x(a33ec6ea-dfe0-400c-93d5-8d56f813c9c0,4104, 172.21.15.24:6800/195140, 1) v3 from client.4104 172.21.15.24:0/981164730 con 0x7f39e3711000 .. 2017-07-13 18:33:11.542530 7f39cb226700 10 _calc_signature seq 1973526481 front_crc_ = 2008202662 middle_crc = 0 data_crc = 0 sig = 412644305609923577 ------ restarted. 2017-07-13 18:35:03.596725 7ff828b2be40 0 ceph version 12.1.0-956-g5f5ec76 (5f5ec7631ad0dfc4e378730e75572c0a1a065661) luminous (rc), process (unknown), pid 197457
i think the mgr somehow failed to connect to a monitor after being disconnected. the connected monitor was stopped by mon thrash job. that's why the mon believed that it's laggy and dropped it.
ms inject delay max: 1 ms inject delay probability: 0.005 ms inject delay type: mon ms inject internal delays: 0.002 ms inject socket failures: 2500
seems that the ms settings delayed the transmission of messages in a destructive way..
/a/kchai-2017-07-13_18:13:10-rados-wip-kefu-testing-distro-basic-smithi/1396188
Updated by Joao Eduardo Luis over 6 years ago
- Related to Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?) added
Updated by Joao Eduardo Luis over 6 years ago
- Related to deleted (Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?))
Updated by Joao Eduardo Luis over 6 years ago
- Is duplicate of Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?) added
Updated by Joao Eduardo Luis over 6 years ago
- Status changed from New to Duplicate
Actions