Bug #3906
closed
ceph-mon leaks memory during peering
Added by Faidon Liambotis over 11 years ago.
Updated over 11 years ago.
Description
I've done multiple OSD swaps with both 0.55 & 0.56/0.56.1 on a cluster with > 16k PGs. In those, I've noticed multiple times that ceph-mon tends to leak significant amount of memory during the peering process (but is otherwise stable). I've seen a few OSDs additions to ramp up the memory in graphs pretty quickly (up to tens of gigabytes of memory), only to be killed by the OOM killer soon after.
I'm afraid I don't have more data on this as the OOM has been faster than I have been, but I'd guess it's easily reproducible.
Additionally, I've seen long-running "ceph -w" instances to also leak memory (up to 15GB!), but it's so minor I don't think it warrants a separate bug report (but would be happy to do so).
- Priority changed from Normal to High
I believe this to be related to #3609
- Status changed from New to 12
- Assignee set to Sage Weil
- Priority changed from High to Urgent
we need to reproduce this on a large internal cluster, with many osds and even more pgs.
the logs indicate this may be related to failed auth connection attempts spamming the monitor.
- Status changed from 12 to Won't Fix
This isn't something that's worth dealing with on the monitor side right now.
So, today I upgraded my whole cluster to 0.56.2, then added a bunch more OSDs (from 84 -> 144). At peering time monitor traffic spiked to 1Gbps and stayed there for ~5mins. At the same time, memory spiked from 1.4G to 13G.
There were reelections at the time and the second monitor also shows similar numbers in network traffic and RAM usage, so it probably falled back there, possibly because of unresponsiveness of the first one because of load or packet loss (the box had a single Gbps and there are signs this got completely filled up).
Memory didn't free up even after the cluster settled down. Nothing unusual in the logs but they weren't in debugging mode.
I think your assumption was that it was the failed auth connection attempts and I think those are supposed to be fixed in 0.56.2, so maybe one of the two assumptions is wrong...
Also available in: Atom
PDF