Actions
Bug #1626
closedceph-mon HA not working right; all must be up
Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
If mon.gamma is down, "ceph -s" hangs trying to connect to all three ceph-mon. The paxos majority rule system does not seem to work here, instead "ceph -s" and other similar commands can be seen explicitly opening a connection to each of the monitors, and hanging until this works.
Updated by Anonymous over 12 years ago
Carl saw it originally. Easy to repro with vstart:
tv@dreamer:~/src/ceph.git/src$ ps uww $(cat out/mon.c.pid ) USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tv 25111 0.2 0.2 93256 8304 ? Ssl 15:02 0:00 ./ceph-mon -i c -c ceph.conf tv@dreamer:~/src/ceph.git/src$ ./ceph -s 2011-10-18 15:03:31.428359 pg v9: 18 pgs: 18 active+clean+degraded; 62 KB data, 82145 MB used, 54591 MB / 140 GB avail; 53/106 degraded (50.000%) 2011-10-18 15:03:31.428520 mds e11: 3/3/3 up {0=b=up:active,1=a=up:active,2=c=up:active} 2011-10-18 15:03:31.428707 osd e3: 1 osds: 1 up, 1 in 2011-10-18 15:03:31.428759 log 2011-10-18 15:03:24.374188 osd.0 127.0.0.1:6800/25175 4 : [INF] 0.3 scrub ok 2011-10-18 15:03:31.428793 mon e1: 3 mons at {a=127.0.0.1:6789/0,b=127.0.0.1:6790/0,c=127.0.0.1:6791/0} tv@dreamer:~/src/ceph.git/src$ kill $(cat out/mon.c.pid ) tv@dreamer:~/src/ceph.git/src$ ./ceph -s 2011-10-18 15:03:36.881920 mds e11: 3/3/3 up {0=b=up:active,1=a=up:active,2=c=up:active} 2011-10-18 15:03:36.882077 osd e3: 1 osds: 1 up, 1 in 2011-10-18 15:03:36.882203 log 2011-10-18 15:03:24.374188 osd.0 127.0.0.1:6800/25175 4 : [INF] 0.3 scrub ok 2011-10-18 15:03:36.882252 mon e1: 3 mons at {a=127.0.0.1:6789/0,b=127.0.0.1:6790/0,c=127.0.0.1:6791/0} (hangs)
Updated by Anonymous over 12 years ago
Oh sorry, what I see with vstart is a 10-second timeout until the mons vote mon.c out. This is not what Carl reported. He says he needs to keep all three mons healthy or ceph commands start hanging.
Updated by Anonymous over 12 years ago
Sorry to dribble this in: it seems with one mon down and voted out, "ceph -s" takes <1sec 66% of the time, ~3sec 33% of the time. Perhaps the 3 seconds is enough to make something like nagios barf?
Updated by Sage Weil about 12 years ago
- Status changed from New to Can't reproduce
Actions