Project

General

Profile

Actions

Bug #1627

closed

ceph-mon memleak if ceph-osd cluster ip is not reachable, but public ip works

Added by Anonymous over 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

acaos: there's a really bad one in the monitors too
acaos: (our monitors hit >20g in size and went belly up)
Tv_: sounds like a bug alright ;)
acaos: I suspect it may be related to the fact that one of our OSDs fell half-off the network (the cluster address did, but not the public)
Tv_: acaos: ooh interesting
acaos: however, we don't have the client_messenger/cluster_messenger fix from last week in
Tv_: we haven't tested failures other than fail-stop that much
Tv_: perhaps that left it the osd in a half-alive state, and it still got messages queued for it
acaos: it was still able to communicate with the mon, but not the other osds
acaos: it was spam-killing the other osds
greglap: acaos: are you using cephx?
acaos: no, we are not
bchrisman: also that can screw up other nodes, as there's no throttling of repeering traffic
greglap: and yes, I could see a split death doing horrible things to memory on other nodes
acaos: the memory leak was before that split death
acaos: at least, the OSD one
acaos: the monitor one was after
greglap: yeah
greglap: the OSD one you're worried about is probably 2f04acb3ccc198076e37e4751cb71ea4fc6e6949
acaos: basically, it was doing stuff like this over and over: mon0 10.0.8.128:6789/0 28065 : [INF] osd166 10.0.10.11:6406/0 failed (by osd255 10.0.10.16:6415/0)
acaos: 10.0.10.16 is the one with the half-dead network
greglap: although actually 8c5cb598357ea452a07704554db27bb674efe21a might be relevant too
acaos: let me glance at those really quickly
acaos: would that pg leak fix in 2f04... happen in a no-failure case?
greglap: acaos: hmm, I don't actually remember
Actions #1

Updated by Sage Weil about 12 years ago

  • Status changed from New to Need More Info
Actions #2

Updated by Sage Weil about 12 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF