Project

General

Profile

Actions

Bug #8651

closed

crashing mds in an active-active mds setup

Added by Kenneth Waegeman almost 10 years ago. Updated over 9 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2 active mds, crashing while writing 4 rsync streams to it with cephko

{ "mdsmap": { "epoch": 1428,
"flags": 0,
"created": "2014-06-17 16:18:31.237442",
"modified": "2014-06-24 10:55:46.211994",
"tableserver": 0,
"root": 0,
"session_timeout": 60,
"session_autoclose": 300,
"max_file_size": 1099511627776,
"last_failure": 1419,
"last_failure_osd_epoch": 814,
"compat": { "compat": {},
"ro_compat": {},
"incompat": { "feature_1": "base v0.20",
"feature_2": "client writeable ranges",
"feature_3": "default file layouts on dirs",
"feature_4": "dir inode in separate object",
"feature_5": "mds uses versioned encoding",
"feature_6": "dirfrag is stored in omap",
"feature_8": "no anchor table"}},
"max_mds": 2,
"in": [
0,
1],
"up": { "mds_0": 23863,
"mds_1": 24119},
"failed": [],
"stopped": [],
"info": { "gid_23863": { "gid": 23863,
"name": "ceph002.cubone.os",
"rank": 0,
"incarnation": 147,
"state": "up:active",
"state_seq": 82,
"addr": "10.141.8.181:6833\/49442",
"standby_for_rank": -1,
"standby_for_name": "",
"export_targets": []},
"gid_24119": { "gid": 24119,
"name": "ceph001.cubone.os",
"rank": 1,
"incarnation": 34,
"state": "up:active",
"state_seq": 19,
"addr": "10.141.8.180:6831\/37285",
"standby_for_rank": -1,
"standby_for_name": "",
"export_targets": []}},
"data_pools": [
0],
"metadata_pool": 1},
"mdsmap_first_committed": 751,
"mdsmap_last_committed": 1428}

     -69> 2014-06-24 10:49:07.213947 7f76628c6700 10 monclient(hunting): renew_subs
   -68> 2014-06-24 10:49:07.213952 7f76628c6700  1 -- 10.141.8.182:6832/43239 <== osd.10 10.141.8.181:6800/24877 10005 ==== osd_op_reply(7
0278 1000009f90c.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~6417,omap-rm-keys 0~433] v811'232250 uv232250 ondisk = 0) v6 ====
 313+0+0 (4229282455 0 0) 0x2624b0c80 con 0x375a3c0
   -67> 2014-06-24 10:49:07.214029 7f76628c6700  5 mds.1.33 is_laggy 79.848239 > 15 since last acked beacon
   -66> 2014-06-24 10:49:07.214035 7f76628c6700  5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one
   -65> 2014-06-24 10:49:07.214037 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name 
   -64> 2014-06-24 10:49:07.214038 7f76628c6700  1 -- 10.141.8.182:6832/43239 mark_down 0x1c277aaa0 -- 0x2624b0f00
   -63> 2014-06-24 10:49:07.214055 7f765a67c700  2 -- 10.141.8.182:6832/43239 >> 10.141.8.181:6789/0 pipe(0x2624b0f00 sd=103 :0 s=4 pgs=0 
cs=0 l=1 c=0x1c277aaa0).connect couldn't read banner, (0) Success
   -62> 2014-06-24 10:49:07.214066 7f76628c6700 10 monclient(hunting): picked mon.ceph001 con 0xb74461a0 addr 10.141.8.180:6789/0
   -61> 2014-06-24 10:49:07.214086 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph001 at 10.141.8.180:6789/0
   -60> 2014-06-24 10:49:07.214091 7f76628c6700  1 -- 10.141.8.182:6832/43239 --> 10.141.8.180:6789/0 -- auth(proto 0 42 bytes epoch 1) v1
 -- ?+0 0x21d300b40 con 0xb74461a0
   -59> 2014-06-24 10:49:07.214101 7f76628c6700 10 monclient(hunting): renew_subs
   -58> 2014-06-24 10:49:07.214081 7f765a67c700  3 -- 10.141.8.182:6832/43239 >> 10.141.8.181:6789/0 pipe(0x2624b0f00 sd=103 :0 s=4 pgs=0 
cs=0 l=1 c=0x1c277aaa0).connect fault, but state = closed != connecting, stopping
   -57> 2014-06-24 10:49:07.214111 7f76628c6700  1 -- 10.141.8.182:6832/43239 <== osd.21 10.141.8.182:6803/26308 10015 ==== osd_op_reply(6
5181 1000006b4f2.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~399] v811'87242 uv87242 ondisk = 0) v6 ==== 271+0+0 (2860133359 0
 0) 0x1b57fd000 con 0x3759fa0
   -56> 2014-06-24 10:49:07.214136 7f76628c6700  5 mds.1.33 is_laggy 79.848346 > 15 since last acked beacon
   -55> 2014-06-24 10:49:07.214141 7f76628c6700  5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one
   -54> 2014-06-24 10:49:07.214143 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name 
   -53> 2014-06-24 10:49:07.214152 7f76628c6700  1 -- 10.141.8.182:6832/43239 mark_down 0xb74461a0 -- 0x1b57fcd80
   -52> 2014-06-24 10:49:07.214179 7f76628c6700 10 monclient(hunting): picked mon.ceph003 con 0xb7447380 addr 10.141.8.182:6789/0
   -51> 2014-06-24 10:49:07.214167 7f765986e700  2 -- 10.141.8.182:6832/43239 >> 10.141.8.180:6789/0 pipe(0x1b57fcd80 sd=130 :0 s=4 pgs=0 
cs=0 l=1 c=0xb74461a0).connect couldn't read banner, (0) Success
   -50> 2014-06-24 10:49:07.214189 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph003 at 10.141.8.182:6789/0
   -49> 2014-06-24 10:49:07.214194 7f76628c6700  1 -- 10.141.8.182:6832/43239 --> 10.141.8.182:6789/0 -- auth(proto 0 42 bytes epoch 1) v1
 -- ?+0 0x21d303f00 con 0xb7447380
   -48> 2014-06-24 10:49:07.214206 7f76628c6700 10 monclient(hunting): renew_subs
   -47> 2014-06-24 10:49:07.214214 7f76628c6700  1 -- 10.141.8.182:6832/43239 <== osd.10 10.141.8.181:6800/24877 10006 ==== osd_op_reply(7
0249 200000ab00b.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~196281,omap-rm-keys 0~13579] v811'250306 uv250306 ondisk = 0) v6 
==== 313+0+0 (1741933792 0 0) 0x26258fd00 con 0x375a3c0
   -46> 2014-06-24 10:49:07.214191 7f765986e700  3 -- 10.141.8.182:6832/43239 >> 10.141.8.180:6789/0 pipe(0x1b57fcd80 sd=130 :0 s=4 pgs=0 
cs=0 l=1 c=0xb74461a0).connect fault, but state = closed != connecting, stopping
   -45> 2014-06-24 10:49:07.216106 7f76628c6700  5 mds.1.33 is_laggy 79.850316 > 15 since last acked beacon
   -44> 2014-06-24 10:49:07.216117 7f76628c6700  5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one
   -43> 2014-06-24 10:49:07.216118 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name 
   -42> 2014-06-24 10:49:07.216121 7f76628c6700  1 -- 10.141.8.182:6832/43239 mark_down 0xb7447380 -- 0x2624b0c80
   -41> 2014-06-24 10:49:07.216169 7f7658f65700  2 -- 10.141.8.182:6832/43239 >> 10.141.8.182:6789/0 pipe(0x2624b0c80 sd=103 :50609 s=4 pg
s=20558 cs=1 l=1 c=0xb7447380).reader couldn't read tag, (0) Success
   -40> 2014-06-24 10:49:07.216186 7f76628c6700 10 monclient(hunting): picked mon.ceph001 con 0x13dcd4ba0 addr 10.141.8.180:6789/0
   -39> 2014-06-24 10:49:07.216209 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph001 at 10.141.8.180:6789/0
   -38> 2014-06-24 10:49:07.216199 7f7658f65700  2 -- 10.141.8.182:6832/43239 >> 10.141.8.182:6789/0 pipe(0x2624b0c80 sd=103 :50609 s=4 pg
s=20558 cs=1 l=1 c=0xb7447380).fault (0) Success
   -37> 2014-06-24 10:49:07.216216 7f76628c6700  1 -- 10.141.8.182:6832/43239 --> 10.141.8.180:6789/0 -- auth(proto 0 42 bytes epoch 1) v1
 -- ?+0 0x21d304140 con 0x13dcd4ba0
   -36> 2014-06-24 10:49:07.216224 7f76628c6700 10 monclient(hunting): renew_subs
   -35> 2014-06-24 10:49:07.216232 7f76628c6700  5 mds.1.33 ms_handle_connect on 10.141.8.182:6789/0
   -34> 2014-06-24 10:49:07.216241 7f76628c6700  1 -- 10.141.8.182:6832/43239 <== osd.21 10.141.8.182:6803/26308 10016 ==== osd_op_reply(65216 1000006b900.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~400] v811'93660 uv93660 ondisk = 0) v6 ==== 271+0+0 (2766708515 0 0) 0x1b57fd280 con 0x3759fa0
   -33> 2014-06-24 10:49:07.216280 7f76628c6700  5 mds.1.33 is_laggy 79.850489 > 15 since last acked beacon
   -32> 2014-06-24 10:49:07.216287 7f76628c6700  5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one
   -31> 2014-06-24 10:49:07.216288 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name 
 -30> 2014-06-24 10:49:07.216292 7f76628c6700  1 -- 10.141.8.182:6832/43239 mark_down 0x13dcd4ba0 -- 0x1b57fd000
   -29> 2014-06-24 10:49:07.216322 7f76628c6700 10 monclient(hunting): picked mon.ceph003 con 0x2c119a20 addr 10.141.8.182:6789/0
   -28> 2014-06-24 10:49:07.216332 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph003 at 10.141.8.182:6789/0
   -27> 2014-06-24 10:49:07.216310 7f765986e700  2 -- 10.141.8.182:6832/43239 >> 10.141.8.180:6789/0 pipe(0x1b57fd000 sd=130 :0 s=4 pgs=0 
cs=0 l=1 c=0x13dcd4ba0).connect couldn't read banner, (0) Success
   -26> 2014-06-24 10:49:07.216341 7f76628c6700  1 -- 10.141.8.182:6832/43239 --> 10.141.8.182:6789/0 -- auth(proto 0 42 bytes epoch 1) v1
 -- ?+0 0x21d3072c0 con 0x2c119a20
   -25> 2014-06-24 10:49:07.216354 7f76628c6700 10 monclient(hunting): renew_subs
   -24> 2014-06-24 10:49:07.216350 7f765986e700  3 -- 10.141.8.182:6832/43239 >> 10.141.8.180:6789/0 pipe(0x1b57fd000 sd=130 :0 s=4 pgs=0 
cs=0 l=1 c=0x13dcd4ba0).connect fault, but state = closed != connecting, stopping
   -23> 2014-06-24 10:49:07.216361 7f76628c6700  1 -- 10.141.8.182:6832/43239 <== osd.10 10.141.8.181:6800/24877 10007 ==== osd_op_reply(7
0317 100000a0f12.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~1594] v811'289159 uv289159 ondisk = 0) v6 ==== 271+0+0 (169687627
8 0 0) 0x26258fa80 con 0x375a3c0
   -22> 2014-06-24 10:49:07.216408 7f76628c6700  5 mds.1.33 is_laggy 79.850617 > 15 since last acked beacon
   -21> 2014-06-24 10:49:07.216411 7f76628c6700  5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one
   -20> 2014-06-24 10:49:07.216412 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name 
   -19> 2014-06-24 10:49:07.216414 7f76628c6700  1 -- 10.141.8.182:6832/43239 mark_down 0x2c119a20 -- 0x26258fd00
   -18> 2014-06-24 10:49:07.216432 7f76628c6700 10 monclient(hunting): picked mon.ceph002 con 0x2c11d540 addr 10.141.8.181:6789/0
   -17> 2014-06-24 10:49:07.216438 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph002 at 10.141.8.181:6789/0
   -16> 2014-06-24 10:49:07.216440 7f76628c6700  1 -- 10.141.8.182:6832/43239 --> 10.141.8.181:6789/0 -- auth(proto 0 42 bytes epoch 1) v1
 -- ?+0 0x21ca686c0 con 0x2c11d540
   -15> 2014-06-24 10:49:07.216445 7f76628c6700 10 monclient(hunting): renew_subs
   -14> 2014-06-24 10:49:07.216451 7f76628c6700  1 -- 10.141.8.182:6832/43239 <== osd.1 10.141.8.180:6803/28178 7803 ==== osd_op_reply(268
39 1000009992c.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~4859,omap-rm-keys 0~371] v813'77416 uv77416 ondisk = 0) v6 ==== 313
+0+0 (2412616183 0 0) 0x25bb8f800 con 0x375b2e0
   -13> 2014-06-24 10:49:07.216497 7f76628c6700  5 mds.1.33 is_laggy 79.850707 > 15 since last acked beacon
   -12> 2014-06-24 10:49:07.216503 7f76628c6700  5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one
   -11> 2014-06-24 10:49:07.216504 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name 
   -10> 2014-06-24 10:49:07.216507 7f76628c6700  1 -- 10.141.8.182:6832/43239 mark_down 0x2c11d540 -- 0x1b57fd280
    -9> 2014-06-24 10:49:07.216484 7f7658f65700  2 -- 10.141.8.182:6832/43239 >> 10.141.8.182:6789/0 pipe(0x26258fd00 sd=103 :0 s=4 pgs=0 
cs=0 l=1 c=0x2c119a20).connect couldn't read banner, (0) Success
    -8> 2014-06-24 10:49:07.216534 7f7658f65700  3 -- 10.141.8.182:6832/43239 >> 10.141.8.182:6789/0 pipe(0x26258fd00 sd=103 :0 s=4 pgs=0 
cs=0 l=1 c=0x2c119a20).connect fault, but state = closed != connecting, stopping
    -7> 2014-06-24 10:49:07.216530 7f765986e700  2 -- 10.141.8.182:6832/43239 >> 10.141.8.181:6789/0 pipe(0x1b57fd280 sd=130 :0 s=4 pgs=0 
cs=0 l=1 c=0x2c11d540).connect couldn't read banner, (0) Success
    -6> 2014-06-24 10:49:07.216551 7f76628c6700 10 monclient(hunting): picked mon.ceph001 con 0x2c11d960 addr 10.141.8.180:6789/0
    -5> 2014-06-24 10:49:07.216569 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph001 at 10.141.8.180:6789/0
    -4> 2014-06-24 10:49:07.216577 7f76628c6700  1 -- 10.141.8.182:6832/43239 --> 10.141.8.180:6789/0 -- auth(proto 0 42 bytes epoch 1) v1
 -- ?+0 0x21ca6ca40 con 0x2c11d960
    -3> 2014-06-24 10:49:07.216554 7f765986e700  3 -- 10.141.8.182:6832/43239 >> 10.141.8.181:6789/0 pipe(0x1b57fd280 sd=130 :0 s=4 pgs=0 
cs=0 l=1 c=0x2c11d540).connect fault, but state = closed != connecting, stopping
    -2> 2014-06-24 10:49:07.216587 7f76628c6700 10 monclient(hunting): renew_subs
    -1> 2014-06-24 10:49:07.216594 7f76628c6700  1 -- 10.141.8.182:6832/43239 <== osd.0 10.141.8.180:6800/26720 7386 ==== osd_op_reply(697
77 200000a8488.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~7618,omap-rm-keys 0~512] v0'0 uv0 ondisk = -108 ((108) Cannot send 
after transport endpoint shutdown)) v6 ==== 313+0+0 (1160011566 0 0) 0x258993480 con 0x3759e40
     0> 2014-06-24 10:49:07.235120 7f76628c6700 -1 mds/CDir.cc: In function 'virtual void C_Dir_Committed::finish(int)' thread 7f76628c670
0 time 2014-06-24 10:49:07.219091
mds/CDir.cc: 1767: FAILED assert(r == 0)

ceph version 0.81 (8de9501df275a5fe29f2c64cb44f195130e4a8fc)
 1: /usr/bin/ceph-mds() [0x75d361]
 2: (Context::complete(int)+0x9) [0x56a829]
 3: (C_Gather::sub_finish(Context*, int)+0x227) [0x56c3a7]
 4: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56c4b2]
 5: (Context::complete(int)+0x9) [0x56a829]
 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf56) [0x7cf306]
 7: (MDS::handle_core_message(Message*)+0xb27) [0x58c447]
 8: (MDS::_dispatch(Message*)+0x45) [0x58c655]
 9: (MDS::ms_dispatch(Message*)+0xc3) [0x58e0a3]
 10: (DispatchQueue::entry()+0x509) [0x9a9409]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x8c205d]
 12: (()+0x7df3) [0x7f766750fdf3]
 13: (clone()+0x6d) [0x7f76663e63dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.ceph003.cubone.os.log
 ceph version 0.81 (8de9501df275a5fe29f2c64cb44f195130e4a8fc)
 1: /usr/bin/ceph-mds() [0x7f4b76]
 2: (()+0xf130) [0x7f7667517130]
 3: (gsignal()+0x39) [0x7f7666325989]
 4: (abort()+0x148) [0x7f7666327098]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f7666c389d5]
 6: (()+0x5e946) [0x7f7666c36946]
 7: (()+0x5e973) [0x7f7666c36973]
 8: (()+0x5eb9f) [0x7f7666c36b9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ef) [0x8daddf]
 10: /usr/bin/ceph-mds() [0x75d361]
 11: (Context::complete(int)+0x9) [0x56a829]
 12: (C_Gather::sub_finish(Context*, int)+0x227) [0x56c3a7]
 13: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56c4b2]
 14: (Context::complete(int)+0x9) [0x56a829]
 15: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf56) [0x7cf306]
 16: (MDS::handle_core_message(Message*)+0xb27) [0x58c447]
 17: (MDS::_dispatch(Message*)+0x45) [0x58c655]
 18: (MDS::ms_dispatch(Message*)+0xc3) [0x58e0a3]
 19: (DispatchQueue::entry()+0x509) [0x9a9409]
 20: (DispatchQueue::DispatchThread::entry()+0xd) [0x8c205d]
 21: (()+0x7df3) [0x7f766750fdf3]
 22: (clone()+0x6d) [0x7f76663e63dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2014-06-24 10:49:07.490111 7f76628c6700 -1 *** Caught signal (Aborted) **
 in thread 7f76628c6700

 ceph version 0.81 (8de9501df275a5fe29f2c64cb44f195130e4a8fc)
 1: /usr/bin/ceph-mds() [0x7f4b76]
 2: (()+0xf130) [0x7f7667517130]
 3: (gsignal()+0x39) [0x7f7666325989]
 4: (abort()+0x148) [0x7f7666327098]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f7666c389d5]
 6: (()+0x5e946) [0x7f7666c36946]
 7: (()+0x5e973) [0x7f7666c36973]
 8: (()+0x5eb9f) [0x7f7666c36b9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ef) [0x8daddf]
 10: /usr/bin/ceph-mds() [0x75d361]
 11: (Context::complete(int)+0x9) [0x56a829]
 12: (C_Gather::sub_finish(Context*, int)+0x227) [0x56c3a7]
 13: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56c4b2]
 14: (Context::complete(int)+0x9) [0x56a829]
 15: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf56) [0x7cf306]
 16: (MDS::handle_core_message(Message*)+0xb27) [0x58c447]
 17: (MDS::_dispatch(Message*)+0x45) [0x58c655]
 18: (MDS::ms_dispatch(Message*)+0xc3) [0x58e0a3]
 19: (DispatchQueue::entry()+0x509) [0x9a9409]
 20: (DispatchQueue::DispatchThread::entry()+0xd) [0x8c205d]
 21: (()+0x7df3) [0x7f766750fdf3]
 22: (clone()+0x6d) [0x7f76663e63dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Files

ceph-log.tar.bz (121 KB) ceph-log.tar.bz Kenneth Waegeman, 06/24/2014 02:15 AM
Actions

Also available in: Atom PDF