Project

General

Profile

Actions

Bug #3275

closed

Monitors unable to recover after network line card replacement

Added by JuanJose Galvez over 11 years ago. Updated about 11 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Roughly around the time that several line cards were replaced the ceph monitors stopped working and were not able to recover on their own from the network event. Monitors returned to normal after restarting them, one OOM'd shortly after and needed to be started up one more time

2012-10-05 04:28:25.573948 mon.0 [2607:f298:4:2243::5752]:6789/0 2787133 : [INF] mon.peon5752 calling new monitor election
2012-10-05 04:28:31.257698 mon.0 [2607:f298:4:2243::5752]:6789/0 2787134 : [INF] mon.peon5752@0 won leader election with quorum 0,2
2012-10-05 04:28:32.148894 mon.0 [2607:f298:4:2243::5752]:6789/0 2787135 : [INF] mdsmap e1: 0/0/1 up
2012-10-05 04:28:32.192594 mon.0 [2607:f298:4:2243::5752]:6789/0 2787136 : [INF] osdmap e230741: 890 osds: 887 up, 887 in
2012-10-05 04:28:32.504294 mon.0 [2607:f298:4:2243::5752]:6789/0 2787137 : [INF] monmap e9: 3 mons at {peon5752=[2607:f298:4:2243::5752]:6789/0,peon5753=[2607:f298:4:2243::5753]:6789/0,peon5754=[2607:f298:4:2243::5754]:6789/0}
2012-10-05 04:27:13.426198 mon.2 [2607:f298:4:2243::5754]:6789/0 157 : [INF] mon.peon5754 calling new monitor election
2012-10-05 04:27:23.809158 mon.2 [2607:f298:4:2243::5754]:6789/0 158 : [INF] mon.peon5754 calling new monitor election
2012-10-05 04:28:07.688839 mon.2 [2607:f298:4:2243::5754]:6789/0 159 : [INF] mon.peon5754 calling new monitor election
2012-10-05 04:28:35.863162 mon.0 [2607:f298:4:2243::5752]:6789/0 2787138 : [INF] pgmap v10558653: 133128 pgs: 133128 active+clean; 21779 GB data, 77777 GB used, 2120 TB / 2196 TB avail
2012-10-05 04:28:44.458902 mon.0 [2607:f298:4:2243::5752]:6789/0 2787139 : [INF] pgmap v10558654: 133128 pgs: 133128 active+clean; 21779 GB data, 77777 GB used, 2120 TB / 2196 TB avail

[4849886.685087] Out of memory: Kill process 8743 (ceph-mon) score 818 or sacrifice child
[4849886.685150] Killed process 8743 (ceph-mon) total-vm:10904880kB, anon-rss:2678320kB, file-rss:0kB

Attached are several graphs, I saved a copy of the cluster log in case that is useful as well.


Files

event-net.png (66.3 KB) event-net.png JuanJose Galvez, 10/08/2012 04:59 PM
event-mem.png (62.6 KB) event-mem.png JuanJose Galvez, 10/08/2012 04:59 PM
event-cpu.png (68.9 KB) event-cpu.png JuanJose Galvez, 10/08/2012 04:59 PM
event-proc.png (30.2 KB) event-proc.png JuanJose Galvez, 10/08/2012 04:59 PM
Actions

Also available in: Atom PDF