Actions
Bug #1627
closedceph-mon memleak if ceph-osd cluster ip is not reachable, but public ip works
Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
acaos: there's a really bad one in the monitors too acaos: (our monitors hit >20g in size and went belly up) Tv_: sounds like a bug alright ;) acaos: I suspect it may be related to the fact that one of our OSDs fell half-off the network (the cluster address did, but not the public) Tv_: acaos: ooh interesting acaos: however, we don't have the client_messenger/cluster_messenger fix from last week in Tv_: we haven't tested failures other than fail-stop that much Tv_: perhaps that left it the osd in a half-alive state, and it still got messages queued for it acaos: it was still able to communicate with the mon, but not the other osds acaos: it was spam-killing the other osds greglap: acaos: are you using cephx? acaos: no, we are not bchrisman: also that can screw up other nodes, as there's no throttling of repeering traffic greglap: and yes, I could see a split death doing horrible things to memory on other nodes acaos: the memory leak was before that split death acaos: at least, the OSD one acaos: the monitor one was after greglap: yeah greglap: the OSD one you're worried about is probably 2f04acb3ccc198076e37e4751cb71ea4fc6e6949 acaos: basically, it was doing stuff like this over and over: mon0 10.0.8.128:6789/0 28065 : [INF] osd166 10.0.10.11:6406/0 failed (by osd255 10.0.10.16:6415/0) acaos: 10.0.10.16 is the one with the half-dead network greglap: although actually 8c5cb598357ea452a07704554db27bb674efe21a might be relevant too acaos: let me glance at those really quickly acaos: would that pg leak fix in 2f04... happen in a no-failure case? greglap: acaos: hmm, I don't actually remember
Updated by Sage Weil about 12 years ago
- Status changed from New to Need More Info
Updated by Sage Weil about 12 years ago
- Status changed from Need More Info to Can't reproduce
Actions