Actions
Bug #17308
closedMDSMonitor should tolerate paxos delays without failing daemons (Was: Unexplained delay forwarding message between mons)
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We have had a couple of mysterious new issues crop up in CephFS that I now think might be the result of some weirdness happening in the mon.
http://tracker.ceph.com/issues/17253
http://tracker.ceph.com/issues/17236
For 17253 I looked closely at the failure here:
http://qa-proxy.ceph.com/teuthology/jspray-2016-09-12_14:31:30-fs-wip-jcsp-testing-20160912-distro-basic-mira/411825/teuthology.log
...and found that there was a 15 second delay happening in forwarding a message between mons when tracing an mdsbeacon message:
remote/mira032/log/ceph-mds.a-s.log.gz:2016-09-12 22:40:33.392905 7fba36e86700 1 -- 172.21.4.112:6808/103879 >> 172.21.4.112:6790/0 conn(0x7fba4a9eb800 sd=18 :-1 s=STATE_OPEN pgs=11 cs=1 l=1). == tx == 0x7fba4aa82340 mdsbeacon(4115/a-s up:active seq 286 v7) v7 remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:33.393058 7fbc737df700 1 -- 172.21.4.112:6790/0 >> 172.21.4.112:6808/103879 conn(0x7fbc854b8800 sd=33 :6790 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1 cs=1 l=1). == rx == mds.0 seq 323 0x7fbc86e0f6c0 mdsbeacon(4115/a-s up:active seq 286 v7) v7 remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:33.393187 7fbc727dd700 1 -- 172.21.4.112:6790/0 <== mds.0 172.21.4.112:6808/103879 323 ==== mdsbeacon(4115/a-s up:active seq 286 v7) v7 ==== 367+0+0 (2027675394 0 0) 0x7fbc86e0f6c0 con 0x7fbc854b8800 remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:33.393246 7fbc727dd700 10 mon.c@2(peon).paxosservice(mdsmap 1..7) dispatch 0x7fbc86e0f6c0 mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 con 0x7fbc854b8800 remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544674 7fbc727dd700 10 mon.c@2(peon).paxosservice(mdsmap 1..8) dispatch 0x7fbc86e0f6c0 mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 con 0x7fbc854b8800 remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544686 7fbc727dd700 10 mon.c@2(peon).mds e8 preprocess_query mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544696 7fbc727dd700 12 mon.c@2(peon).mds e8 preprocess_beacon mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544710 7fbc727dd700 10 mon.c@2(peon) e1 forward_request 765 request mdsbeacon(4115/a-s up:active seq 286 v7) v7 features 1152921504336314367 remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544714 7fbc727dd700 1 -- 172.21.4.112:6790/0 _send_message--> mon.0 172.21.4.112:6789/0 -- forward(mdsbeacon(4115/a-s up:active seq 286 v7) v7 caps allow * tid 765 con_features 1152921504336314367) to leader v3 -- ?+0 0x7fbc86f1ab00 remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544721 7fbc727dd700 1 -- 172.21.4.112:6790/0 >> 172.21.4.112:6789/0 conn(0x7fbc854ad000 sd=21 :-1 s=STATE_OPEN pgs=5 cs=1 l=0). == tx == 0x7fbc86f1ab00 forward(mdsbeacon(4115/a-s up:active seq 286 v7) v7 caps allow * tid 765 con_features 1152921504336314367) to leader v3 remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.544817 7fe340fe8700 1 -- 172.21.4.112:6789/0 >> 172.21.4.112:6790/0 conn(0x7fe353203800 sd=30 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=0). == rx == mon.2 seq 2510 0x7fe354495600 forward(mdsbeacon(4115/a-s up:active seq 286 v7) v7 caps allow * tid 765 con_features 1152921504336314367) to leader v3 remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545342 7fe33ffe6700 1 -- 172.21.4.112:6789/0 <== mon.2 172.21.4.112:6790/0 2510 ==== forward(mdsbeacon(4115/a-s up:active seq 286 v7) v7 caps allow * tid 765 con_features 1152921504336314367) to leader v3 ==== 634+0+0 (1224176700 0 0) 0x7fe354495600 con 0x7fe353203800 remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545391 7fe33ffe6700 10 mon.b@0(leader).paxosservice(mdsmap 1..8) dispatch 0x7fe354892340 mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 con 0x7fe35359b100 remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545401 7fe33ffe6700 10 mon.b@0(leader).mds e8 preprocess_query mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545411 7fe33ffe6700 12 mon.b@0(leader).mds e8 preprocess_beacon mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545416 7fe33ffe6700 7 mon.b@0(leader).mds e8 mds_beacon mdsbeacon(4115/a-s up:active seq 286 v7) v7 is not in fsmap (state up:active) remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545431 7fe33ffe6700 15 mon.b@0(leader) e1 send_reply routing reply to 172.21.4.112:6808/103879 via 172.21.4.112:6790/0 for request mdsbeacon(4115/a-s up:active seq 286 v7) v7
Actions