Actions
Bug #5079
closedassert in MDCache::_recovered()
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
While trying to reproduce 4999 with the requested logging, I got this MDS assert.
I'm running cuttlefish branch @ commit 4c0d3eb72c04 with 576 OSDs (24 OSD/host), 3 mons, 1 active + 2 standby MDS.
-32> 2013-05-15 12:20:40.425901 7ffff36e7700 10 monclient: _reopen_session -31> 2013-05-15 12:20:40.425988 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs30 con 0x7cfac00 addr 172.17.131.30:6789/0 -30> 2013-05-15 12:20:40.426019 7ffff36e7700 10 monclient: _send_mon_message to mon.cs30 at 172.17.131.30:6789/0 -29> 2013-05-15 12:20:40.426093 7ffff36e7700 10 monclient: renew_subs -28> 2013-05-15 12:20:40.426530 7ffff36e7700 5 mds.0.9 ms_handle_connect on 172.17.131.30:6789/0 -27> 2013-05-15 12:20:40.426600 7ffff36e7700 5 mds.0.9 is_laggy 91.408902 > 15 since last acked beacon -26> 2013-05-15 12:20:40.426615 7ffff36e7700 5 mds.0.9 initiating monitor reconnect; maybe we're not the slow one -25> 2013-05-15 12:20:40.426617 7ffff36e7700 10 monclient: _reopen_session -24> 2013-05-15 12:20:40.426679 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs28 con 0x7cfe1a0 addr 172.17.131.28:6789/0 -23> 2013-05-15 12:20:40.426698 7ffff36e7700 10 monclient: _send_mon_message to mon.cs28 at 172.17.131.28:6789/0 -22> 2013-05-15 12:20:40.426779 7ffff36e7700 10 monclient: renew_subs -21> 2013-05-15 12:20:40.426843 7ffff36e7700 5 mds.0.9 is_laggy 91.409145 > 15 since last acked beacon -20> 2013-05-15 12:20:40.426875 7ffff36e7700 5 mds.0.9 initiating monitor reconnect; maybe we're not the slow one -19> 2013-05-15 12:20:40.426877 7ffff36e7700 10 monclient: _reopen_session -18> 2013-05-15 12:20:40.426957 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs30 con 0x1508b00 addr 172.17.131.30:6789/0 -17> 2013-05-15 12:20:40.426976 7ffff36e7700 10 monclient: _send_mon_message to mon.cs30 at 172.17.131.30:6789/0 -16> 2013-05-15 12:20:40.427090 7ffff36e7700 10 monclient: renew_subs -15> 2013-05-15 12:20:40.427146 7ffff36e7700 5 mds.0.9 is_laggy 91.409449 > 15 since last acked beacon -14> 2013-05-15 12:20:40.427160 7ffff36e7700 5 mds.0.9 initiating monitor reconnect; maybe we're not the slow one -13> 2013-05-15 12:20:40.427163 7ffff36e7700 10 monclient: _reopen_session -12> 2013-05-15 12:20:40.427225 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs28 con 0x7cfb2e0 addr 172.17.131.28:6789/0 -11> 2013-05-15 12:20:40.427330 7ffff36e7700 10 monclient: _send_mon_message to mon.cs28 at 172.17.131.28:6789/0 -10> 2013-05-15 12:20:40.427455 7ffff36e7700 10 monclient: renew_subs -9> 2013-05-15 12:20:40.427986 7ffff36e7700 5 mds.0.9 ms_handle_connect on 172.17.131.28:6789/0 -8> 2013-05-15 12:20:40.428043 7ffff36e7700 5 mds.0.9 is_laggy 91.410346 > 15 since last acked beacon -7> 2013-05-15 12:20:40.428057 7ffff36e7700 5 mds.0.9 initiating monitor reconnect; maybe we're not the slow one -6> 2013-05-15 12:20:40.428059 7ffff36e7700 10 monclient: _reopen_session -5> 2013-05-15 12:20:40.428117 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs29 con 0x7cfb180 addr 172.17.131.29:6789/0 -4> 2013-05-15 12:20:40.428134 7ffff36e7700 10 monclient: _send_mon_message to mon.cs29 at 172.17.131.29:6789/0 -3> 2013-05-15 12:20:40.428139 7ffff36e7700 10 monclient: renew_subs -2> 2013-05-15 12:20:40.428368 7ffff36e7700 0 mds.0.cache recovery error! -108 -1> 2013-05-15 12:20:40.428435 7ffff36e7700 1 mds.0.9 suicide. wanted down:dne, now up:active 0> 2013-05-15 12:20:40.505064 7ffff36e7700 -1 mds/MDCache.cc: In function 'void MDCache::_recovered(CInode*, int, uint64_t, utime_t)' thread 7ffff36e7700 time 2013-05-15 12:20:40.501908 mds/MDCache.cc: 5658: FAILED assert(0) ceph version 0.61.2-3-g40f8381 (40f838143a2a647f61f95b85a214695af1d9d1d5) 1: (MDCache::_recovered(CInode*, int, unsigned long, utime_t)+0x109) [0x577e89] 2: (Filer::_probed(Filer::Probe*, object_t const&, unsigned long, utime_t)+0xb3a) [0x72d6aa] 3: (Objecter::C_Stat::finish(int)+0xc9) [0x72e079] 4: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe58) [0x71e478] 5: (MDS::handle_core_message(Message*)+0x957) [0x4cae07] 6: (MDS::_dispatch(Message*)+0x2f) [0x4caebf] 7: (MDS::ms_dispatch(Message*)+0x1fb) [0x4cc9eb] 8: (DispatchQueue::entry()+0x3e1) [0x83ccf1] 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x79973d] 10: (()+0x77f1) [0x7ffff76507f1] 11: (clone()+0x6d) [0x7ffff6ca5ccd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/ceph/log/mds.cs30.log --- end dump of recent events --- 2013-05-15 12:20:40.793947 7ffff36e7700 -1 *** Caught signal (Aborted) ** in thread 7ffff36e7700 ceph version 0.61.2-3-g40f8381 (40f838143a2a647f61f95b85a214695af1d9d1d5) 1: /usr/bin/ceph-mds() [0x85e0c9] 2: (()+0xf4a0) [0x7ffff76584a0] 3: (gsignal()+0x35) [0x7ffff6bf2885] 4: (abort()+0x175) [0x7ffff6bf4065] 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3be84bea7d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2013-05-15 12:20:40.793947 7ffff36e7700 -1 *** Caught signal (Aborted) ** in thread 7ffff36e7700 ceph version 0.61.2-3-g40f8381 (40f838143a2a647f61f95b85a214695af1d9d1d5) 1: /usr/bin/ceph-mds() [0x85e0c9] 2: (()+0xf4a0) [0x7ffff76584a0] 3: (gsignal()+0x35) [0x7ffff6bf2885] 4: (abort()+0x175) [0x7ffff6bf4065] 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3be84bea7d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/ceph/log/mds.cs30.log --- end dump of recent events ---
gdb had this to say:
Program terminated with signal 6, Aborted. #0 0x00007ffff765836b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 warning: Source file is more recent than executable. 42 sig); Missing separate debuginfos, use: debuginfo-install gperftools-libs-2.0-3.el6.2.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 libuuid-2.17.2-12.4.el6.x86_64 nspr-4.9-1.el6.x86_64 nss-3.13.3-8.el6.x86_64 nss-softokn-3.12.9-11.el6.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.13.3-2.el6.x86_64 sqlite-3.6.20-1.el6.x86_64 (gdb) bt #0 0x00007ffff765836b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 #1 0x000000000085e4f7 in reraise_fatal (signum=6) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 #3 <signal handler called> #4 0x00007ffff6bf2885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #5 0x00007ffff6bf4065 in abort () at abort.c:92 #6 0x0000003be84bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #7 0x0000003be84bcc06 in ?? () from /usr/lib64/libstdc++.so.6 #8 0x0000003be84bcc33 in std::terminate() () from /usr/lib64/libstdc++.so.6 #9 0x0000003be84bcd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6 #10 0x00000000007552c9 in ceph::__ceph_assert_fail (assertion=0x1438000 "\001", file=0x4c83dc0 "P|\310\004", line=5658, func=0x877c20 "void MDCache::_recovered(CInode*, int, uint64_t, utime_t)") at common/assert.cc:77 #11 0x0000000000577e89 in MDCache::_recovered (this=0x1570000, in=<value optimized out>, r=-108, size=0, mtime=<value optimized out>) at mds/MDCache.cc:5658 #12 0x000000000072d6aa in Filer::_probed (this=0x13f8140, probe=0x1540690, oid=..., size=0, mtime=...) at osdc/Filer.cc:134 #13 0x000000000072e079 in Objecter::C_Stat::finish (this=0x2780380, r=-108) at osdc/Objecter.h:796 #14 0x000000000071e478 in Objecter::handle_osd_op_reply (this=0x1568000, m=0x13f36600) at osdc/Objecter.cc:1480 #15 0x00000000004cae07 in MDS::handle_core_message (this=0x14f8b00, m=0x13f36600) at mds/MDS.cc:1733 #16 0x00000000004caebf in MDS::_dispatch (this=0x14f8b00, m=0x13f36600) at mds/MDS.cc:1857 #17 0x00000000004cc9eb in MDS::ms_dispatch (this=0x14f8b00, m=0x13f36600) at mds/MDS.cc:1668 #18 0x000000000083ccf1 in ms_deliver_dispatch (this=0x14f80e8) at msg/Messenger.h:566 #19 DispatchQueue::entry (this=0x14f80e8) at msg/DispatchQueue.cc:119 #20 0x000000000079973d in DispatchQueue::DispatchThread::entry (this=<value optimized out>) at msg/DispatchQueue.h:104 #21 0x00007ffff76507f1 in start_thread (arg=0x7ffff36e7700) at pthread_create.c:301 #22 0x00007ffff6ca5ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) f 11 #11 0x0000000000577e89 in MDCache::_recovered (this=0x1570000, in=<value optimized out>, r=-108, size=0, mtime=<value optimized out>) at mds/MDCache.cc:5658 5658 assert(0); (gdb) l 5653 5654 if (r != 0) { 5655 dout(0) << "recovery error! " << r << dendl; 5656 if (r == -EBLACKLISTED) 5657 mds->suicide(); 5658 assert(0); 5659 } 5660 5661 file_recovering.erase(in); 5662 in->state_clear(CInode::STATE_RECOVERING); (gdb) f 12 #12 0x000000000072d6aa in Filer::_probed (this=0x13f8140, probe=0x1540690, oid=..., size=0, mtime=...) at osdc/Filer.cc:134 134 probe->onfinish->finish(probe->err); (gdb) l 129 130 if (!probe->ops.empty()) 131 return; // waiting for more! 132 133 if (probe->err) { // we hit an error, propagate back up 134 probe->onfinish->finish(probe->err); 135 delete probe->onfinish; 136 delete probe; 137 return; 138 } (gdb) f 13 #13 0x000000000072e079 in Objecter::C_Stat::finish (this=0x2780380, r=-108) at osdc/Objecter.h:796 796 fin->finish(r); (gdb) l 791 if (psize) 792 *psize = s; 793 if (pmtime) 794 *pmtime = m; 795 } 796 fin->finish(r); 797 delete fin; 798 } 799 }; 800 (gdb) f 14 #14 0x000000000071e478 in Objecter::handle_osd_op_reply (this=0x1568000, m=0x13f36600) at osdc/Objecter.cc:1480 1480 onack->finish(rc); (gdb) l 1475 1476 ldout(cct, 5) << num_unacked << " unacked, " << num_uncommitted << " uncommitted" << dendl; 1477 1478 // do callbacks 1479 if (onack) { 1480 onack->finish(rc); 1481 delete onack; 1482 } 1483 if (oncommit) { 1484 oncommit->finish(rc); (gdb)
Updated by Sage Weil almost 11 years ago
- Status changed from New to Resolved
thanks, this one was easy to fix.
commit:64871e093159ad06d84fb2a84c7808a81800dfc4
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.
Actions