Bug #1626
closed
ceph-mon HA not working right; all must be up
Added by Anonymous over 12 years ago.
Updated about 12 years ago.
Description
If mon.gamma is down, "ceph -s" hangs trying to connect to all three ceph-mon. The paxos majority rule system does not seem to work here, instead "ceph -s" and other similar commands can be seen explicitly opening a connection to each of the monitors, and hanging until this works.
Carl saw it originally. Easy to repro with vstart:
tv@dreamer:~/src/ceph.git/src$ ps uww $(cat out/mon.c.pid )
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
tv 25111 0.2 0.2 93256 8304 ? Ssl 15:02 0:00 ./ceph-mon -i c -c ceph.conf
tv@dreamer:~/src/ceph.git/src$ ./ceph -s
2011-10-18 15:03:31.428359 pg v9: 18 pgs: 18 active+clean+degraded; 62 KB data, 82145 MB used, 54591 MB / 140 GB avail; 53/106 degraded (50.000%)
2011-10-18 15:03:31.428520 mds e11: 3/3/3 up {0=b=up:active,1=a=up:active,2=c=up:active}
2011-10-18 15:03:31.428707 osd e3: 1 osds: 1 up, 1 in
2011-10-18 15:03:31.428759 log 2011-10-18 15:03:24.374188 osd.0 127.0.0.1:6800/25175 4 : [INF] 0.3 scrub ok
2011-10-18 15:03:31.428793 mon e1: 3 mons at {a=127.0.0.1:6789/0,b=127.0.0.1:6790/0,c=127.0.0.1:6791/0}
tv@dreamer:~/src/ceph.git/src$ kill $(cat out/mon.c.pid )
tv@dreamer:~/src/ceph.git/src$ ./ceph -s
2011-10-18 15:03:36.881920 mds e11: 3/3/3 up {0=b=up:active,1=a=up:active,2=c=up:active}
2011-10-18 15:03:36.882077 osd e3: 1 osds: 1 up, 1 in
2011-10-18 15:03:36.882203 log 2011-10-18 15:03:24.374188 osd.0 127.0.0.1:6800/25175 4 : [INF] 0.3 scrub ok
2011-10-18 15:03:36.882252 mon e1: 3 mons at {a=127.0.0.1:6789/0,b=127.0.0.1:6790/0,c=127.0.0.1:6791/0}
(hangs)
Oh sorry, what I see with vstart is a 10-second timeout until the mons vote mon.c out. This is not what Carl reported. He says he needs to keep all three mons healthy or ceph commands start hanging.
Sorry to dribble this in: it seems with one mon down and voted out, "ceph -s" takes <1sec 66% of the time, ~3sec 33% of the time. Perhaps the 3 seconds is enough to make something like nagios barf?
- Status changed from New to Can't reproduce
Also available in: Atom
PDF