Project

General

Profile

Actions

Bug #2200

closed

mon: not accepting new connections

Added by Yehuda Sadeh about 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Following a networking downtime and monitors restart (as described in #2199), and following a recovery process, all active monitors stopped accepting connections.

Couldn't telnet to port 6789, was getting connection refused or something similar (as if mons weren't listening on the ports).

At the moment the logs for both monitors are on the machines at /var/log/ceph/2199.

Actions #1

Updated by Greg Farnum about 12 years ago

  • Assignee set to Greg Farnum
Actions #2

Updated by Greg Farnum about 12 years ago

There's not a lot I can do to diagnose this with just logs; the Monitors don't refuse connections like that on their own.

My best guess is that maybe there were enough clients connected that the Monitor ran out of sockets and so the OS stopped allowing incoming connections. I'm running some greps to check out that theory.

Actions #3

Updated by Greg Farnum about 12 years ago

Okay, that appears to not be it (the connections established and terminated match for clients and are only off by 9 overall — although there are >90k of them!).
I can't come up with much else that would cause this; if the monitors themselves ever noticed a resource limit they would have asserted out, and I can't think of any way that they self-limit the number of connections right now.

Actions #4

Updated by Greg Farnum about 12 years ago

  • Status changed from New to Can't reproduce

Yehuda's indicated that this might be tied in to networking issues that were ongoing at the time. Given the symptoms he's described here and in person, I think that's the most likely explanation. We don't have any messenger logging from the time period so we certainly can't diagnose it now.

Actions

Also available in: Atom PDF