Project

General

Profile

Actions

Bug #17308

closed

MDSMonitor should tolerate paxos delays without failing daemons (Was: Unexplained delay forwarding message between mons)

Added by John Spray over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have had a couple of mysterious new issues crop up in CephFS that I now think might be the result of some weirdness happening in the mon.
http://tracker.ceph.com/issues/17253
http://tracker.ceph.com/issues/17236

For 17253 I looked closely at the failure here:
http://qa-proxy.ceph.com/teuthology/jspray-2016-09-12_14:31:30-fs-wip-jcsp-testing-20160912-distro-basic-mira/411825/teuthology.log
...and found that there was a 15 second delay happening in forwarding a message between mons when tracing an mdsbeacon message:

remote/mira032/log/ceph-mds.a-s.log.gz:2016-09-12 22:40:33.392905 7fba36e86700  1 -- 172.21.4.112:6808/103879 >> 172.21.4.112:6790/0 conn(0x7fba4a9eb800 sd=18 :-1 s=STATE_OPEN pgs=11 cs=1 l=1). == tx == 0x7fba4aa82340 mdsbeacon(4115/a-s up:active seq 286 v7) v7

remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:33.393058 7fbc737df700  1 -- 172.21.4.112:6790/0 >> 172.21.4.112:6808/103879 conn(0x7fbc854b8800 sd=33 :6790 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1 cs=1 l=1). == rx == mds.0 seq 323 0x7fbc86e0f6c0 mdsbeacon(4115/a-s up:active seq 286 v7) v7
remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:33.393187 7fbc727dd700  1 -- 172.21.4.112:6790/0 <== mds.0 172.21.4.112:6808/103879 323 ==== mdsbeacon(4115/a-s up:active seq 286 v7) v7 ==== 367+0+0 (2027675394 0 0) 0x7fbc86e0f6c0 con 0x7fbc854b8800
remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:33.393246 7fbc727dd700 10 mon.c@2(peon).paxosservice(mdsmap 1..7) dispatch 0x7fbc86e0f6c0 mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 con 0x7fbc854b8800
remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544674 7fbc727dd700 10 mon.c@2(peon).paxosservice(mdsmap 1..8) dispatch 0x7fbc86e0f6c0 mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 con 0x7fbc854b8800
remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544686 7fbc727dd700 10 mon.c@2(peon).mds e8 preprocess_query mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879
remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544696 7fbc727dd700 12 mon.c@2(peon).mds e8 preprocess_beacon mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544710 7fbc727dd700 10 mon.c@2(peon) e1 forward_request 765 request mdsbeacon(4115/a-s up:active seq 286 v7) v7 features 1152921504336314367
remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544714 7fbc727dd700  1 -- 172.21.4.112:6790/0 _send_message--> mon.0 172.21.4.112:6789/0 -- forward(mdsbeacon(4115/a-s up:active seq 286 v7) v7 caps allow * tid 765 con_features 1152921504336314367) to leader v3 -- ?+0 0x7fbc86f1ab00
remote/mira032/log/ceph-mon.c.log.gz:2016-09-12 22:40:48.544721 7fbc727dd700  1 -- 172.21.4.112:6790/0 >> 172.21.4.112:6789/0 conn(0x7fbc854ad000 sd=21 :-1 s=STATE_OPEN pgs=5 cs=1 l=0). == tx == 0x7fbc86f1ab00 forward(mdsbeacon(4115/a-s up:active seq 286 v7) v7 caps allow * tid 765 con_features 1152921504336314367) to leader v3

remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.544817 7fe340fe8700  1 -- 172.21.4.112:6789/0 >> 172.21.4.112:6790/0 conn(0x7fe353203800 sd=30 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=0). == rx == mon.2 seq 2510 0x7fe354495600 forward(mdsbeacon(4115/a-s up:active seq 286 v7) v7 caps allow * tid 765 con_features 1152921504336314367) to leader v3
remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545342 7fe33ffe6700  1 -- 172.21.4.112:6789/0 <== mon.2 172.21.4.112:6790/0 2510 ==== forward(mdsbeacon(4115/a-s up:active seq 286 v7) v7 caps allow * tid 765 con_features 1152921504336314367) to leader v3 ==== 634+0+0 (1224176700 0 0) 0x7fe354495600 con 0x7fe353203800
remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545391 7fe33ffe6700 10 mon.b@0(leader).paxosservice(mdsmap 1..8) dispatch 0x7fe354892340 mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 con 0x7fe35359b100
remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545401 7fe33ffe6700 10 mon.b@0(leader).mds e8 preprocess_query mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879
remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545411 7fe33ffe6700 12 mon.b@0(leader).mds e8 preprocess_beacon mdsbeacon(4115/a-s up:active seq 286 v7) v7 from mds.0 172.21.4.112:6808/103879 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545416 7fe33ffe6700  7 mon.b@0(leader).mds e8 mds_beacon mdsbeacon(4115/a-s up:active seq 286 v7) v7 is not in fsmap (state up:active)
remote/mira032/log/ceph-mon.b.log.gz:2016-09-12 22:40:48.545431 7fe33ffe6700 15 mon.b@0(leader) e1 send_reply routing reply to 172.21.4.112:6808/103879 via 172.21.4.112:6790/0 for request mdsbeacon(4115/a-s up:active seq 286 v7) v7


Related issues 1 (0 open1 closed)

Related to CephFS - Feature #17309: qa: mon_thrash test for CephFSResolvedJos Collin

Actions
Actions

Also available in: Atom PDF