Bug #3906: ceph-mon leaks memory during peering - Ceph - Ceph

Actions

Copy link

Bug #3906

closed

ceph-mon leaks memory during peering

Added by Faidon Liambotis over 11 years ago. Updated about 11 years ago.

Status:

Won't Fix

Priority:

Urgent

Assignee:

Sage Weil

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I've done multiple OSD swaps with both 0.55 & 0.56/0.56.1 on a cluster with > 16k PGs. In those, I've noticed multiple times that ceph-mon tends to leak significant amount of memory during the peering process (but is otherwise stable). I've seen a few OSDs additions to ramp up the memory in graphs pretty quickly (up to tens of gigabytes of memory), only to be killed by the OOM killer soon after.

I'm afraid I don't have more data on this as the OOM has been faster than I have been, but I'd guess it's easily reproducible.

Additionally, I've seen long-running "ceph -w" instances to also leak memory (up to 15GB!), but it's so minor I don't think it warrants a separate bug report (but would be happy to do so).

Actions

Copy link

Updated by Sage Weil over 11 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

I believe this to be related to #3609

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from New to 12
Assignee set to Sage Weil
Priority changed from High to Urgent

we need to reproduce this on a large internal cluster, with many osds and even more pgs.

Actions

Copy link

Updated by Sage Weil over 11 years ago

the logs indicate this may be related to failed auth connection attempts spamming the monitor.

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from 12 to Won't Fix

This isn't something that's worth dealing with on the monitor side right now.

Actions

Copy link

Updated by Faidon Liambotis about 11 years ago

So, today I upgraded my whole cluster to 0.56.2, then added a bunch more OSDs (from 84 -> 144). At peering time monitor traffic spiked to 1Gbps and stayed there for ~5mins. At the same time, memory spiked from 1.4G to 13G.

There were reelections at the time and the second monitor also shows similar numbers in network traffic and RAM usage, so it probably falled back there, possibly because of unresponsiveness of the first one because of load or packet loss (the box had a single Gbps and there are signs this got completely filled up).

Memory didn't free up even after the cluster settled down. Nothing unusual in the logs but they weren't in debugging mode.

I think your assumption was that it was the failed auth connection attempts and I think those are supposed to be fixed in 0.56.2, so maybe one of the two assumptions is wrong...

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3906

ceph-mon leaks memory during peering

Updated by Sage Weil over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Faidon Liambotis about 11 years ago