Project

General

Profile

Actions

Bug #1626

closed

ceph-mon HA not working right; all must be up

Added by Anonymous over 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If mon.gamma is down, "ceph -s" hangs trying to connect to all three ceph-mon. The paxos majority rule system does not seem to work here, instead "ceph -s" and other similar commands can be seen explicitly opening a connection to each of the monitors, and hanging until this works.

Actions #1

Updated by Sage Weil over 12 years ago

where did you see this?

Actions #2

Updated by Anonymous over 12 years ago

Carl saw it originally. Easy to repro with vstart:

tv@dreamer:~/src/ceph.git/src$ ps uww $(cat out/mon.c.pid )
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
tv       25111  0.2  0.2  93256  8304 ?        Ssl  15:02   0:00 ./ceph-mon -i c -c ceph.conf
tv@dreamer:~/src/ceph.git/src$ ./ceph -s
2011-10-18 15:03:31.428359    pg v9: 18 pgs: 18 active+clean+degraded; 62 KB data, 82145 MB used, 54591 MB / 140 GB avail; 53/106 degraded (50.000%)
2011-10-18 15:03:31.428520   mds e11: 3/3/3 up {0=b=up:active,1=a=up:active,2=c=up:active}
2011-10-18 15:03:31.428707   osd e3: 1 osds: 1 up, 1 in
2011-10-18 15:03:31.428759   log 2011-10-18 15:03:24.374188 osd.0 127.0.0.1:6800/25175 4 : [INF] 0.3 scrub ok
2011-10-18 15:03:31.428793   mon e1: 3 mons at {a=127.0.0.1:6789/0,b=127.0.0.1:6790/0,c=127.0.0.1:6791/0}
tv@dreamer:~/src/ceph.git/src$ kill $(cat out/mon.c.pid )
tv@dreamer:~/src/ceph.git/src$ ./ceph -s
2011-10-18 15:03:36.881920   mds e11: 3/3/3 up {0=b=up:active,1=a=up:active,2=c=up:active}
2011-10-18 15:03:36.882077   osd e3: 1 osds: 1 up, 1 in
2011-10-18 15:03:36.882203   log 2011-10-18 15:03:24.374188 osd.0 127.0.0.1:6800/25175 4 : [INF] 0.3 scrub ok
2011-10-18 15:03:36.882252   mon e1: 3 mons at {a=127.0.0.1:6789/0,b=127.0.0.1:6790/0,c=127.0.0.1:6791/0}
(hangs)
Actions #3

Updated by Anonymous over 12 years ago

Oh sorry, what I see with vstart is a 10-second timeout until the mons vote mon.c out. This is not what Carl reported. He says he needs to keep all three mons healthy or ceph commands start hanging.

Actions #4

Updated by Anonymous over 12 years ago

Sorry to dribble this in: it seems with one mon down and voted out, "ceph -s" takes <1sec 66% of the time, ~3sec 33% of the time. Perhaps the 3 seconds is enough to make something like nagios barf?

Actions #5

Updated by Sage Weil about 12 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF