Actions
Bug #9212
closedmon election delays mds beacon
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ubuntu@teuthology:/a/teuthology-2014-08-22_23:04:01-fs-master-testing-basic-multi/444359
-1> 2014-08-23 10:45:49.604837 7f49bde2a700 1 -- 10.214.136.20:6810/6997 --> 10.214.136.20:6790/0 -- mon_subscribe({mdsmap=13+,monmap=2+}) v2 -- ?+0 0x37bb2a0 con 0xc7723c0 0> 2014-08-23 10:45:50.925167 7f49bf62d700 -1 mds/CInode.cc: In function 'virtual void C_Inode_StoredBacktrace::finish(int)' thread 7f49bf62d700 time 2014-08-23 10:45:33.541131 mds/CInode.cc: 1024: FAILED assert(r == 0) ceph version 0.84-562-g8d40600 (8d406001d9b84d9809d181077c61ad9181934752) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x8d8f0b] 2: ceph-mds() [0x755ea5] 3: (Context::complete(int)+0x9) [0x58b4b9] 4: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7a936e] 5: (MDS::handle_core_message(Message*)+0xb37) [0x588c67] 6: (MDS::_dispatch(Message*)+0x32) [0x588e62] 7: (MDS::ms_dispatch(Message*)+0x8b) [0x58a82b] 8: (DispatchQueue::entry()+0x649) [0x9a18f9] 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x8c2a2d] 10: (()+0x8182) [0x7f49c43ab182] 11: (clone()+0x6d) [0x7f49c2d2338d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Greg Farnum over 9 years ago
Did we identify why it was blacklisted? I don't think we have any tests that should make it that slow or whatever.
Updated by Sage Weil over 9 years ago
- Subject changed from mds/CInode.cc: 1024: FAILED assert(r == 0) in C_Inode_StoredBacktrace::finish(int) to mds blacklisted during non-thrashing qa run
- Status changed from Rejected to 12
Updated by Zheng Yan over 9 years ago
mds sent beacon
2014-08-23 10:45:06.242821 7f49bcd27700 10 mds.0.2 beacon_send up:active seq 1259 (currently up:active) 2014-08-23 10:45:06.242850 7f49bcd27700 1 -- 10.214.136.20:6810/6997 --> 10.214.136.20:6790/0 -- mdsbeacon(4108/a-s up:active seq 1259 v12) v2 -- ?+0 0x1384b700 con 0xc7723c0
mon.c received the beacon. but paxos is not readable.
2014-08-23 10:45:10.108818 7fc85e01a700 1 -- 10.214.136.20:6790/0 <== mds.0 10.214.136.20:6810/6997 254 ==== mdsbeacon(4108/a-s up:active seq 1259 v12) v2 ==== 345+0+0 (1305681894 0 0) 0x58b2dc0 con 0x3090260 2014-08-23 10:45:10.108823 7fc85e01a700 20 mon.c@2(peon) e1 have connection 2014-08-23 10:45:10.108824 7fc85e01a700 20 mon.c@2(peon) e1 ms_dispatch existing session MonSession: mds.0 10.214.136.20:6810/6997 is openallow * for mds.0 10.214.136.20:6810/6997 2014-08-23 10:45:10.108828 7fc85e01a700 20 mon.c@2(peon) e1 caps allow * 2014-08-23 10:45:10.108829 7fc85e01a700 10 mon.c@2(peon).paxosservice(mdsmap 1..12) dispatch mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 2014-08-23 10:45:10.108832 7fc85e01a700 1 mon.c@2(peon).paxos(paxos updating c 3766..4295) is_readable now=2014-08-23 10:45:10.108833 lease_expire=0.000000 has v0 lc 4295 2014-08-23 10:45:10.108835 7fc85e01a700 10 mon.c@2(peon).paxosservice(mdsmap 1..12) waiting for paxos -> readable (v12)
mon.c forward the beacon to mon.a 17 seconds later
2014-08-23 10:45:27.992235 7fc85e01a700 10 mon.c@2(peon).paxosservice(mdsmap 1..12) dispatch mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 2014-08-23 10:45:27.992242 7fc85e01a700 1 mon.c@2(peon).paxos(paxos active c 3766..4298) is_readable now=2014-08-23 10:45:27.992243 lease_expire=2014-08-23 10:45:29.841523 has v0 lc 4298 2014-08-23 10:45:27.992250 7fc85e01a700 10 mon.c@2(peon).mds e12 preprocess_query mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 2014-08-23 10:45:27.992255 7fc85e01a700 20 is_capable service=mds command= exec on cap allow * 2014-08-23 10:45:27.992257 7fc85e01a700 20 allow so far , doing grant allow * 2014-08-23 10:45:27.992258 7fc85e01a700 20 allow all 2014-08-23 10:45:27.992260 7fc85e01a700 12 mon.c@2(peon).mds e12 preprocess_beacon mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} 2014-08-23 10:45:27.992273 7fc85e01a700 10 mon.c@2(peon) e1 forward_request 740 request mdsbeacon(4108/a-s up:active seq 1259 v12) v2 2014-08-23 10:45:27.992276 7fc85e01a700 1 -- 10.214.136.20:6790/0 --> mon.0 10.214.136.18:6789/0 -- forward(mdsbeacon(4108/a-s up:active seq 1259 v12) v2 caps allow * tid 740 con_features 17592186044415) to leader v2 -- ?+0 0x501b20
mon.a received the beacon, but paxos is not readable
2014-08-23 10:45:27.993322 7f354ea4f700 10 mon.a@0(leader).paxosservice(mdsmap 1..12) dispatch mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 2014-08-23 10:45:27.993325 7f354ea4f700 10 mon.a@0(leader).paxosservice(mdsmap 1..12) waiting for paxos -> readable (v12)
mon.a processed the beacon about 40 seconds later.
2014-08-23 10:45:39.226301 7f354ea4f700 10 mon.a@0(leader).paxosservice(mdsmap 1..14) dispatch mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 2014-08-23 10:45:39.226305 7f354ea4f700 1 mon.a@0(leader).paxos(paxos updating c 3766..4302) is_readable now=2014-08-23 10:45:39.226306 lease_expire=2014-08-23 10:45:44.225784 has v0 lc 4302 2014-08-23 10:45:39.226309 7f354ea4f700 10 mon.a@0(leader).mds e14 preprocess_query mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 2014-08-23 10:45:39.226312 7f354ea4f700 20 is_capable service=mds command= exec on cap allow * 2014-08-23 10:45:39.226314 7f354ea4f700 20 allow so far , doing grant allow * 2014-08-23 10:45:39.226315 7f354ea4f700 20 allow all 2014-08-23 10:45:39.226315 7f354ea4f700 12 mon.a@0(leader).mds e14 preprocess_beacon mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
Updated by Sage Weil over 9 years ago
- Subject changed from mds blacklisted during non-thrashing qa run to mon election delays mds beacon
- Assignee set to Sage Weil
Updated by Zheng Yan over 9 years ago
- Status changed from 12 to Won't Fix
2014-08-23 10:45:15.412870 7fc85e01a700 -1 mon.c@2(peon).paxos(paxos updating c 3766..4296) lease_expire from mon.0 10.214.136.18:6789/0 is 0.302315 seconds in the past; mons are probably laggy (or possibly clocks are too skewed) 2014-08-23 10:45:15.412875 7fc85e01a700 10 mon.c@2(peon).paxos(paxos active c 3766..4296) handle_lease on 4296 now 2014-08-23 10:45:15.110555
2014-08-23 10:45:23.385284 7fc85e01a700 -1 mon.c@2(peon).paxos(paxos updating c 3766..4297) lease_expire from mon.0 10.214.136.18:6789/0 is 0.353771 seconds in the past; mons are probably laggy (or possibly clocks are too skewed) 2014-08-23 10:45:23.385289 7fc85e01a700 10 mon.c@2(peon).paxos(paxos active c 3766..4297) handle_lease on 4297 now 2014-08-23 10:45:23.031513
looks like the delay was caused by clock skew among the monitors
Actions