Bug #8651
closedcrashing mds in an active-active mds setup
0%
Description
2 active mds, crashing while writing 4 rsync streams to it with cephko
{ "mdsmap": { "epoch": 1428,
"flags": 0,
"created": "2014-06-17 16:18:31.237442",
"modified": "2014-06-24 10:55:46.211994",
"tableserver": 0,
"root": 0,
"session_timeout": 60,
"session_autoclose": 300,
"max_file_size": 1099511627776,
"last_failure": 1419,
"last_failure_osd_epoch": 814,
"compat": { "compat": {},
"ro_compat": {},
"incompat": { "feature_1": "base v0.20",
"feature_2": "client writeable ranges",
"feature_3": "default file layouts on dirs",
"feature_4": "dir inode in separate object",
"feature_5": "mds uses versioned encoding",
"feature_6": "dirfrag is stored in omap",
"feature_8": "no anchor table"}},
"max_mds": 2,
"in": [
0,
1],
"up": { "mds_0": 23863,
"mds_1": 24119},
"failed": [],
"stopped": [],
"info": { "gid_23863": { "gid": 23863,
"name": "ceph002.cubone.os",
"rank": 0,
"incarnation": 147,
"state": "up:active",
"state_seq": 82,
"addr": "10.141.8.181:6833\/49442",
"standby_for_rank": -1,
"standby_for_name": "",
"export_targets": []},
"gid_24119": { "gid": 24119,
"name": "ceph001.cubone.os",
"rank": 1,
"incarnation": 34,
"state": "up:active",
"state_seq": 19,
"addr": "10.141.8.180:6831\/37285",
"standby_for_rank": -1,
"standby_for_name": "",
"export_targets": []}},
"data_pools": [
0],
"metadata_pool": 1},
"mdsmap_first_committed": 751,
"mdsmap_last_committed": 1428}
-69> 2014-06-24 10:49:07.213947 7f76628c6700 10 monclient(hunting): renew_subs -68> 2014-06-24 10:49:07.213952 7f76628c6700 1 -- 10.141.8.182:6832/43239 <== osd.10 10.141.8.181:6800/24877 10005 ==== osd_op_reply(7 0278 1000009f90c.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~6417,omap-rm-keys 0~433] v811'232250 uv232250 ondisk = 0) v6 ==== 313+0+0 (4229282455 0 0) 0x2624b0c80 con 0x375a3c0 -67> 2014-06-24 10:49:07.214029 7f76628c6700 5 mds.1.33 is_laggy 79.848239 > 15 since last acked beacon -66> 2014-06-24 10:49:07.214035 7f76628c6700 5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one -65> 2014-06-24 10:49:07.214037 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name -64> 2014-06-24 10:49:07.214038 7f76628c6700 1 -- 10.141.8.182:6832/43239 mark_down 0x1c277aaa0 -- 0x2624b0f00 -63> 2014-06-24 10:49:07.214055 7f765a67c700 2 -- 10.141.8.182:6832/43239 >> 10.141.8.181:6789/0 pipe(0x2624b0f00 sd=103 :0 s=4 pgs=0 cs=0 l=1 c=0x1c277aaa0).connect couldn't read banner, (0) Success -62> 2014-06-24 10:49:07.214066 7f76628c6700 10 monclient(hunting): picked mon.ceph001 con 0xb74461a0 addr 10.141.8.180:6789/0 -61> 2014-06-24 10:49:07.214086 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph001 at 10.141.8.180:6789/0 -60> 2014-06-24 10:49:07.214091 7f76628c6700 1 -- 10.141.8.182:6832/43239 --> 10.141.8.180:6789/0 -- auth(proto 0 42 bytes epoch 1) v1 -- ?+0 0x21d300b40 con 0xb74461a0 -59> 2014-06-24 10:49:07.214101 7f76628c6700 10 monclient(hunting): renew_subs -58> 2014-06-24 10:49:07.214081 7f765a67c700 3 -- 10.141.8.182:6832/43239 >> 10.141.8.181:6789/0 pipe(0x2624b0f00 sd=103 :0 s=4 pgs=0 cs=0 l=1 c=0x1c277aaa0).connect fault, but state = closed != connecting, stopping -57> 2014-06-24 10:49:07.214111 7f76628c6700 1 -- 10.141.8.182:6832/43239 <== osd.21 10.141.8.182:6803/26308 10015 ==== osd_op_reply(6 5181 1000006b4f2.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~399] v811'87242 uv87242 ondisk = 0) v6 ==== 271+0+0 (2860133359 0 0) 0x1b57fd000 con 0x3759fa0 -56> 2014-06-24 10:49:07.214136 7f76628c6700 5 mds.1.33 is_laggy 79.848346 > 15 since last acked beacon -55> 2014-06-24 10:49:07.214141 7f76628c6700 5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one -54> 2014-06-24 10:49:07.214143 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name -53> 2014-06-24 10:49:07.214152 7f76628c6700 1 -- 10.141.8.182:6832/43239 mark_down 0xb74461a0 -- 0x1b57fcd80 -52> 2014-06-24 10:49:07.214179 7f76628c6700 10 monclient(hunting): picked mon.ceph003 con 0xb7447380 addr 10.141.8.182:6789/0 -51> 2014-06-24 10:49:07.214167 7f765986e700 2 -- 10.141.8.182:6832/43239 >> 10.141.8.180:6789/0 pipe(0x1b57fcd80 sd=130 :0 s=4 pgs=0 cs=0 l=1 c=0xb74461a0).connect couldn't read banner, (0) Success -50> 2014-06-24 10:49:07.214189 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph003 at 10.141.8.182:6789/0 -49> 2014-06-24 10:49:07.214194 7f76628c6700 1 -- 10.141.8.182:6832/43239 --> 10.141.8.182:6789/0 -- auth(proto 0 42 bytes epoch 1) v1 -- ?+0 0x21d303f00 con 0xb7447380 -48> 2014-06-24 10:49:07.214206 7f76628c6700 10 monclient(hunting): renew_subs -47> 2014-06-24 10:49:07.214214 7f76628c6700 1 -- 10.141.8.182:6832/43239 <== osd.10 10.141.8.181:6800/24877 10006 ==== osd_op_reply(7 0249 200000ab00b.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~196281,omap-rm-keys 0~13579] v811'250306 uv250306 ondisk = 0) v6 ==== 313+0+0 (1741933792 0 0) 0x26258fd00 con 0x375a3c0 -46> 2014-06-24 10:49:07.214191 7f765986e700 3 -- 10.141.8.182:6832/43239 >> 10.141.8.180:6789/0 pipe(0x1b57fcd80 sd=130 :0 s=4 pgs=0 cs=0 l=1 c=0xb74461a0).connect fault, but state = closed != connecting, stopping -45> 2014-06-24 10:49:07.216106 7f76628c6700 5 mds.1.33 is_laggy 79.850316 > 15 since last acked beacon -44> 2014-06-24 10:49:07.216117 7f76628c6700 5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one -43> 2014-06-24 10:49:07.216118 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name -42> 2014-06-24 10:49:07.216121 7f76628c6700 1 -- 10.141.8.182:6832/43239 mark_down 0xb7447380 -- 0x2624b0c80 -41> 2014-06-24 10:49:07.216169 7f7658f65700 2 -- 10.141.8.182:6832/43239 >> 10.141.8.182:6789/0 pipe(0x2624b0c80 sd=103 :50609 s=4 pg s=20558 cs=1 l=1 c=0xb7447380).reader couldn't read tag, (0) Success -40> 2014-06-24 10:49:07.216186 7f76628c6700 10 monclient(hunting): picked mon.ceph001 con 0x13dcd4ba0 addr 10.141.8.180:6789/0 -39> 2014-06-24 10:49:07.216209 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph001 at 10.141.8.180:6789/0 -38> 2014-06-24 10:49:07.216199 7f7658f65700 2 -- 10.141.8.182:6832/43239 >> 10.141.8.182:6789/0 pipe(0x2624b0c80 sd=103 :50609 s=4 pg s=20558 cs=1 l=1 c=0xb7447380).fault (0) Success -37> 2014-06-24 10:49:07.216216 7f76628c6700 1 -- 10.141.8.182:6832/43239 --> 10.141.8.180:6789/0 -- auth(proto 0 42 bytes epoch 1) v1 -- ?+0 0x21d304140 con 0x13dcd4ba0 -36> 2014-06-24 10:49:07.216224 7f76628c6700 10 monclient(hunting): renew_subs -35> 2014-06-24 10:49:07.216232 7f76628c6700 5 mds.1.33 ms_handle_connect on 10.141.8.182:6789/0 -34> 2014-06-24 10:49:07.216241 7f76628c6700 1 -- 10.141.8.182:6832/43239 <== osd.21 10.141.8.182:6803/26308 10016 ==== osd_op_reply(65216 1000006b900.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~400] v811'93660 uv93660 ondisk = 0) v6 ==== 271+0+0 (2766708515 0 0) 0x1b57fd280 con 0x3759fa0 -33> 2014-06-24 10:49:07.216280 7f76628c6700 5 mds.1.33 is_laggy 79.850489 > 15 since last acked beacon -32> 2014-06-24 10:49:07.216287 7f76628c6700 5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one -31> 2014-06-24 10:49:07.216288 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name -30> 2014-06-24 10:49:07.216292 7f76628c6700 1 -- 10.141.8.182:6832/43239 mark_down 0x13dcd4ba0 -- 0x1b57fd000 -29> 2014-06-24 10:49:07.216322 7f76628c6700 10 monclient(hunting): picked mon.ceph003 con 0x2c119a20 addr 10.141.8.182:6789/0 -28> 2014-06-24 10:49:07.216332 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph003 at 10.141.8.182:6789/0 -27> 2014-06-24 10:49:07.216310 7f765986e700 2 -- 10.141.8.182:6832/43239 >> 10.141.8.180:6789/0 pipe(0x1b57fd000 sd=130 :0 s=4 pgs=0 cs=0 l=1 c=0x13dcd4ba0).connect couldn't read banner, (0) Success -26> 2014-06-24 10:49:07.216341 7f76628c6700 1 -- 10.141.8.182:6832/43239 --> 10.141.8.182:6789/0 -- auth(proto 0 42 bytes epoch 1) v1 -- ?+0 0x21d3072c0 con 0x2c119a20 -25> 2014-06-24 10:49:07.216354 7f76628c6700 10 monclient(hunting): renew_subs -24> 2014-06-24 10:49:07.216350 7f765986e700 3 -- 10.141.8.182:6832/43239 >> 10.141.8.180:6789/0 pipe(0x1b57fd000 sd=130 :0 s=4 pgs=0 cs=0 l=1 c=0x13dcd4ba0).connect fault, but state = closed != connecting, stopping -23> 2014-06-24 10:49:07.216361 7f76628c6700 1 -- 10.141.8.182:6832/43239 <== osd.10 10.141.8.181:6800/24877 10007 ==== osd_op_reply(7 0317 100000a0f12.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~1594] v811'289159 uv289159 ondisk = 0) v6 ==== 271+0+0 (169687627 8 0 0) 0x26258fa80 con 0x375a3c0 -22> 2014-06-24 10:49:07.216408 7f76628c6700 5 mds.1.33 is_laggy 79.850617 > 15 since last acked beacon -21> 2014-06-24 10:49:07.216411 7f76628c6700 5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one -20> 2014-06-24 10:49:07.216412 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name -19> 2014-06-24 10:49:07.216414 7f76628c6700 1 -- 10.141.8.182:6832/43239 mark_down 0x2c119a20 -- 0x26258fd00 -18> 2014-06-24 10:49:07.216432 7f76628c6700 10 monclient(hunting): picked mon.ceph002 con 0x2c11d540 addr 10.141.8.181:6789/0 -17> 2014-06-24 10:49:07.216438 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph002 at 10.141.8.181:6789/0 -16> 2014-06-24 10:49:07.216440 7f76628c6700 1 -- 10.141.8.182:6832/43239 --> 10.141.8.181:6789/0 -- auth(proto 0 42 bytes epoch 1) v1 -- ?+0 0x21ca686c0 con 0x2c11d540 -15> 2014-06-24 10:49:07.216445 7f76628c6700 10 monclient(hunting): renew_subs -14> 2014-06-24 10:49:07.216451 7f76628c6700 1 -- 10.141.8.182:6832/43239 <== osd.1 10.141.8.180:6803/28178 7803 ==== osd_op_reply(268 39 1000009992c.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~4859,omap-rm-keys 0~371] v813'77416 uv77416 ondisk = 0) v6 ==== 313 +0+0 (2412616183 0 0) 0x25bb8f800 con 0x375b2e0 -13> 2014-06-24 10:49:07.216497 7f76628c6700 5 mds.1.33 is_laggy 79.850707 > 15 since last acked beacon -12> 2014-06-24 10:49:07.216503 7f76628c6700 5 mds.1.33 initiating monitor reconnect; maybe we're not the slow one -11> 2014-06-24 10:49:07.216504 7f76628c6700 10 monclient(hunting): _reopen_session rank -1 name -10> 2014-06-24 10:49:07.216507 7f76628c6700 1 -- 10.141.8.182:6832/43239 mark_down 0x2c11d540 -- 0x1b57fd280 -9> 2014-06-24 10:49:07.216484 7f7658f65700 2 -- 10.141.8.182:6832/43239 >> 10.141.8.182:6789/0 pipe(0x26258fd00 sd=103 :0 s=4 pgs=0 cs=0 l=1 c=0x2c119a20).connect couldn't read banner, (0) Success -8> 2014-06-24 10:49:07.216534 7f7658f65700 3 -- 10.141.8.182:6832/43239 >> 10.141.8.182:6789/0 pipe(0x26258fd00 sd=103 :0 s=4 pgs=0 cs=0 l=1 c=0x2c119a20).connect fault, but state = closed != connecting, stopping -7> 2014-06-24 10:49:07.216530 7f765986e700 2 -- 10.141.8.182:6832/43239 >> 10.141.8.181:6789/0 pipe(0x1b57fd280 sd=130 :0 s=4 pgs=0 cs=0 l=1 c=0x2c11d540).connect couldn't read banner, (0) Success -6> 2014-06-24 10:49:07.216551 7f76628c6700 10 monclient(hunting): picked mon.ceph001 con 0x2c11d960 addr 10.141.8.180:6789/0 -5> 2014-06-24 10:49:07.216569 7f76628c6700 10 monclient(hunting): _send_mon_message to mon.ceph001 at 10.141.8.180:6789/0 -4> 2014-06-24 10:49:07.216577 7f76628c6700 1 -- 10.141.8.182:6832/43239 --> 10.141.8.180:6789/0 -- auth(proto 0 42 bytes epoch 1) v1 -- ?+0 0x21ca6ca40 con 0x2c11d960 -3> 2014-06-24 10:49:07.216554 7f765986e700 3 -- 10.141.8.182:6832/43239 >> 10.141.8.181:6789/0 pipe(0x1b57fd280 sd=130 :0 s=4 pgs=0 cs=0 l=1 c=0x2c11d540).connect fault, but state = closed != connecting, stopping -2> 2014-06-24 10:49:07.216587 7f76628c6700 10 monclient(hunting): renew_subs -1> 2014-06-24 10:49:07.216594 7f76628c6700 1 -- 10.141.8.182:6832/43239 <== osd.0 10.141.8.180:6800/26720 7386 ==== osd_op_reply(697 77 200000a8488.00000000 [??? 1~0,omap-set-header 0~222,omap-set-vals 0~7618,omap-rm-keys 0~512] v0'0 uv0 ondisk = -108 ((108) Cannot send after transport endpoint shutdown)) v6 ==== 313+0+0 (1160011566 0 0) 0x258993480 con 0x3759e40 0> 2014-06-24 10:49:07.235120 7f76628c6700 -1 mds/CDir.cc: In function 'virtual void C_Dir_Committed::finish(int)' thread 7f76628c670 0 time 2014-06-24 10:49:07.219091 mds/CDir.cc: 1767: FAILED assert(r == 0) ceph version 0.81 (8de9501df275a5fe29f2c64cb44f195130e4a8fc) 1: /usr/bin/ceph-mds() [0x75d361] 2: (Context::complete(int)+0x9) [0x56a829] 3: (C_Gather::sub_finish(Context*, int)+0x227) [0x56c3a7] 4: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56c4b2] 5: (Context::complete(int)+0x9) [0x56a829] 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf56) [0x7cf306] 7: (MDS::handle_core_message(Message*)+0xb27) [0x58c447] 8: (MDS::_dispatch(Message*)+0x45) [0x58c655] 9: (MDS::ms_dispatch(Message*)+0xc3) [0x58e0a3] 10: (DispatchQueue::entry()+0x509) [0x9a9409] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x8c205d] 12: (()+0x7df3) [0x7f766750fdf3] 13: (clone()+0x6d) [0x7f76663e63dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mds.ceph003.cubone.os.log ceph version 0.81 (8de9501df275a5fe29f2c64cb44f195130e4a8fc) 1: /usr/bin/ceph-mds() [0x7f4b76] 2: (()+0xf130) [0x7f7667517130] 3: (gsignal()+0x39) [0x7f7666325989] 4: (abort()+0x148) [0x7f7666327098] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f7666c389d5] 6: (()+0x5e946) [0x7f7666c36946] 7: (()+0x5e973) [0x7f7666c36973] 8: (()+0x5eb9f) [0x7f7666c36b9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ef) [0x8daddf] 10: /usr/bin/ceph-mds() [0x75d361] 11: (Context::complete(int)+0x9) [0x56a829] 12: (C_Gather::sub_finish(Context*, int)+0x227) [0x56c3a7] 13: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56c4b2] 14: (Context::complete(int)+0x9) [0x56a829] 15: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf56) [0x7cf306] 16: (MDS::handle_core_message(Message*)+0xb27) [0x58c447] 17: (MDS::_dispatch(Message*)+0x45) [0x58c655] 18: (MDS::ms_dispatch(Message*)+0xc3) [0x58e0a3] 19: (DispatchQueue::entry()+0x509) [0x9a9409] 20: (DispatchQueue::DispatchThread::entry()+0xd) [0x8c205d] 21: (()+0x7df3) [0x7f766750fdf3] 22: (clone()+0x6d) [0x7f76663e63dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2014-06-24 10:49:07.490111 7f76628c6700 -1 *** Caught signal (Aborted) ** in thread 7f76628c6700 ceph version 0.81 (8de9501df275a5fe29f2c64cb44f195130e4a8fc) 1: /usr/bin/ceph-mds() [0x7f4b76] 2: (()+0xf130) [0x7f7667517130] 3: (gsignal()+0x39) [0x7f7666325989] 4: (abort()+0x148) [0x7f7666327098] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f7666c389d5] 6: (()+0x5e946) [0x7f7666c36946] 7: (()+0x5e973) [0x7f7666c36973] 8: (()+0x5eb9f) [0x7f7666c36b9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ef) [0x8daddf] 10: /usr/bin/ceph-mds() [0x75d361] 11: (Context::complete(int)+0x9) [0x56a829] 12: (C_Gather::sub_finish(Context*, int)+0x227) [0x56c3a7] 13: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56c4b2] 14: (Context::complete(int)+0x9) [0x56a829] 15: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf56) [0x7cf306] 16: (MDS::handle_core_message(Message*)+0xb27) [0x58c447] 17: (MDS::_dispatch(Message*)+0x45) [0x58c655] 18: (MDS::ms_dispatch(Message*)+0xc3) [0x58e0a3] 19: (DispatchQueue::entry()+0x509) [0x9a9409] 20: (DispatchQueue::DispatchThread::entry()+0xd) [0x8c205d] 21: (()+0x7df3) [0x7f766750fdf3] 22: (clone()+0x6d) [0x7f76663e63dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files