Project

General

Profile

Actions

Bug #3906

closed

ceph-mon leaks memory during peering

Added by Faidon Liambotis over 11 years ago. Updated about 11 years ago.

Status:
Won't Fix
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've done multiple OSD swaps with both 0.55 & 0.56/0.56.1 on a cluster with > 16k PGs. In those, I've noticed multiple times that ceph-mon tends to leak significant amount of memory during the peering process (but is otherwise stable). I've seen a few OSDs additions to ramp up the memory in graphs pretty quickly (up to tens of gigabytes of memory), only to be killed by the OOM killer soon after.

I'm afraid I don't have more data on this as the OOM has been faster than I have been, but I'd guess it's easily reproducible.

Additionally, I've seen long-running "ceph -w" instances to also leak memory (up to 15GB!), but it's so minor I don't think it warrants a separate bug report (but would be happy to do so).

Actions #1

Updated by Sage Weil over 11 years ago

  • Priority changed from Normal to High
Actions #2

Updated by Joao Eduardo Luis over 11 years ago

I believe this to be related to #3609

Actions #3

Updated by Sage Weil over 11 years ago

  • Status changed from New to 12
  • Assignee set to Sage Weil
  • Priority changed from High to Urgent

we need to reproduce this on a large internal cluster, with many osds and even more pgs.

Actions #4

Updated by Sage Weil over 11 years ago

the logs indicate this may be related to failed auth connection attempts spamming the monitor.

Actions #5

Updated by Sage Weil about 11 years ago

  • Status changed from 12 to Won't Fix

This isn't something that's worth dealing with on the monitor side right now.

Actions #6

Updated by Faidon Liambotis about 11 years ago

So, today I upgraded my whole cluster to 0.56.2, then added a bunch more OSDs (from 84 -> 144). At peering time monitor traffic spiked to 1Gbps and stayed there for ~5mins. At the same time, memory spiked from 1.4G to 13G.

There were reelections at the time and the second monitor also shows similar numbers in network traffic and RAM usage, so it probably falled back there, possibly because of unresponsiveness of the first one because of load or packet loss (the box had a single Gbps and there are signs this got completely filled up).

Memory didn't free up even after the cluster settled down. Nothing unusual in the logs but they weren't in debugging mode.

I think your assumption was that it was the failed auth connection attempts and I think those are supposed to be fixed in 0.56.2, so maybe one of the two assumptions is wrong...

Actions

Also available in: Atom PDF