Project

General

Profile

Actions

Bug #9212

closed

mon election delays mds beacon

Added by Sage Weil over 9 years ago. Updated almost 8 years ago.

Status:
Won't Fix
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-08-22_23:04:01-fs-master-testing-basic-multi/444359

    -1> 2014-08-23 10:45:49.604837 7f49bde2a700  1 -- 10.214.136.20:6810/6997 --> 10.214.136.20:6790/0 -- mon_subscribe({mdsmap=13+,monmap=2+}) v2 -- ?+0 0x37bb2a0 con 0xc7723c0
     0> 2014-08-23 10:45:50.925167 7f49bf62d700 -1 mds/CInode.cc: In function 'virtual void C_Inode_StoredBacktrace::finish(int)' thread 7f49bf62d700 time 2014-08-23 10:45:33.541131
mds/CInode.cc: 1024: FAILED assert(r == 0)

 ceph version 0.84-562-g8d40600 (8d406001d9b84d9809d181077c61ad9181934752)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x8d8f0b]
 2: ceph-mds() [0x755ea5]
 3: (Context::complete(int)+0x9) [0x58b4b9]
 4: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7a936e]
 5: (MDS::handle_core_message(Message*)+0xb37) [0x588c67]
 6: (MDS::_dispatch(Message*)+0x32) [0x588e62]
 7: (MDS::ms_dispatch(Message*)+0x8b) [0x58a82b]
 8: (DispatchQueue::entry()+0x649) [0x9a18f9]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x8c2a2d]
 10: (()+0x8182) [0x7f49c43ab182]
 11: (clone()+0x6d) [0x7f49c2d2338d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #1

Updated by Sage Weil over 9 years ago

  • Status changed from New to Rejected

EBLACKLISTED

Actions #2

Updated by Greg Farnum over 9 years ago

Did we identify why it was blacklisted? I don't think we have any tests that should make it that slow or whatever.

Actions #3

Updated by Sage Weil over 9 years ago

  • Subject changed from mds/CInode.cc: 1024: FAILED assert(r == 0) in C_Inode_StoredBacktrace::finish(int) to mds blacklisted during non-thrashing qa run
  • Status changed from Rejected to 12
Actions #4

Updated by Zheng Yan over 9 years ago

mds sent beacon

2014-08-23 10:45:06.242821 7f49bcd27700 10 mds.0.2 beacon_send up:active seq 1259 (currently up:active)
2014-08-23 10:45:06.242850 7f49bcd27700  1 -- 10.214.136.20:6810/6997 --> 10.214.136.20:6790/0 -- mdsbeacon(4108/a-s up:active seq 1259 v12) v2 -- ?+0 0x1384b700 con 0xc7723c0

mon.c received the beacon. but paxos is not readable.

2014-08-23 10:45:10.108818 7fc85e01a700  1 -- 10.214.136.20:6790/0 <== mds.0 10.214.136.20:6810/6997 254 ==== mdsbeacon(4108/a-s up:active seq 1259 v12) v2 ==== 345+0+0 (1305681894 0 0) 0x58b2dc0 con 0x3090260
2014-08-23 10:45:10.108823 7fc85e01a700 20 mon.c@2(peon) e1 have connection
2014-08-23 10:45:10.108824 7fc85e01a700 20 mon.c@2(peon) e1 ms_dispatch existing session MonSession: mds.0 10.214.136.20:6810/6997 is openallow * for mds.0 10.214.136.20:6810/6997
2014-08-23 10:45:10.108828 7fc85e01a700 20 mon.c@2(peon) e1  caps allow *
2014-08-23 10:45:10.108829 7fc85e01a700 10 mon.c@2(peon).paxosservice(mdsmap 1..12) dispatch mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997
2014-08-23 10:45:10.108832 7fc85e01a700  1 mon.c@2(peon).paxos(paxos updating c 3766..4295) is_readable now=2014-08-23 10:45:10.108833 lease_expire=0.000000 has v0 lc 4295
2014-08-23 10:45:10.108835 7fc85e01a700 10 mon.c@2(peon).paxosservice(mdsmap 1..12)  waiting for paxos -> readable (v12)

mon.c forward the beacon to mon.a 17 seconds later

2014-08-23 10:45:27.992235 7fc85e01a700 10 mon.c@2(peon).paxosservice(mdsmap 1..12) dispatch mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997
2014-08-23 10:45:27.992242 7fc85e01a700  1 mon.c@2(peon).paxos(paxos active c 3766..4298) is_readable now=2014-08-23 10:45:27.992243 lease_expire=2014-08-23 10:45:29.841523 has v0 lc 4298
2014-08-23 10:45:27.992250 7fc85e01a700 10 mon.c@2(peon).mds e12 preprocess_query mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997
2014-08-23 10:45:27.992255 7fc85e01a700 20 is_capable service=mds command= exec on cap allow *
2014-08-23 10:45:27.992257 7fc85e01a700 20  allow so far , doing grant allow *
2014-08-23 10:45:27.992258 7fc85e01a700 20  allow all
2014-08-23 10:45:27.992260 7fc85e01a700 12 mon.c@2(peon).mds e12 preprocess_beacon mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
2014-08-23 10:45:27.992273 7fc85e01a700 10 mon.c@2(peon) e1 forward_request 740 request mdsbeacon(4108/a-s up:active seq 1259 v12) v2
2014-08-23 10:45:27.992276 7fc85e01a700  1 -- 10.214.136.20:6790/0 --> mon.0 10.214.136.18:6789/0 -- forward(mdsbeacon(4108/a-s up:active seq 1259 v12) v2 caps allow * tid 740 con_features 17592186044415) to leader v2 -- ?+0 0x501b20

mon.a received the beacon, but paxos is not readable

    2014-08-23 10:45:27.993322 7f354ea4f700 10 mon.a@0(leader).paxosservice(mdsmap 1..12) dispatch mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997
    2014-08-23 10:45:27.993325 7f354ea4f700 10 mon.a@0(leader).paxosservice(mdsmap 1..12)  waiting for paxos -> readable (v12)

mon.a processed the beacon about 40 seconds later.

2014-08-23 10:45:39.226301 7f354ea4f700 10 mon.a@0(leader).paxosservice(mdsmap 1..14) dispatch mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997
2014-08-23 10:45:39.226305 7f354ea4f700  1 mon.a@0(leader).paxos(paxos updating c 3766..4302) is_readable now=2014-08-23 10:45:39.226306 lease_expire=2014-08-23 10:45:44.225784 has v0 lc 4302
2014-08-23 10:45:39.226309 7f354ea4f700 10 mon.a@0(leader).mds e14 preprocess_query mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997
2014-08-23 10:45:39.226312 7f354ea4f700 20 is_capable service=mds command= exec on cap allow *
2014-08-23 10:45:39.226314 7f354ea4f700 20  allow so far , doing grant allow *
2014-08-23 10:45:39.226315 7f354ea4f700 20  allow all
2014-08-23 10:45:39.226315 7f354ea4f700 12 mon.a@0(leader).mds e14 preprocess_beacon mdsbeacon(4108/a-s up:active seq 1259 v12) v2 from mds.0 10.214.136.20:6810/6997 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}

Actions #5

Updated by Sage Weil over 9 years ago

  • Subject changed from mds blacklisted during non-thrashing qa run to mon election delays mds beacon
  • Assignee set to Sage Weil
Actions #6

Updated by Zheng Yan over 9 years ago

  • Status changed from 12 to Won't Fix
2014-08-23 10:45:15.412870 7fc85e01a700 -1 mon.c@2(peon).paxos(paxos updating c 3766..4296) lease_expire from mon.0 10.214.136.18:6789/0 is 0.302315 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2014-08-23 10:45:15.412875 7fc85e01a700 10 mon.c@2(peon).paxos(paxos active c 3766..4296) handle_lease on 4296 now 2014-08-23 10:45:15.110555
2014-08-23 10:45:23.385284 7fc85e01a700 -1 mon.c@2(peon).paxos(paxos updating c 3766..4297) lease_expire from mon.0 10.214.136.18:6789/0 is 0.353771 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2014-08-23 10:45:23.385289 7fc85e01a700 10 mon.c@2(peon).paxos(paxos active c 3766..4297) handle_lease on 4297 now 2014-08-23 10:45:23.031513

looks like the delay was caused by clock skew among the monitors

Actions #7

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF