Project

General

Profile

Actions

Bug #5079

closed

assert in MDCache::_recovered()

Added by Jim Schutt almost 11 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While trying to reproduce 4999 with the requested logging, I got this MDS assert.

I'm running cuttlefish branch @ commit 4c0d3eb72c04 with 576 OSDs (24 OSD/host), 3 mons, 1 active + 2 standby MDS.

   -32> 2013-05-15 12:20:40.425901 7ffff36e7700 10 monclient: _reopen_session
   -31> 2013-05-15 12:20:40.425988 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs30 con 0x7cfac00 addr 172.17.131.30:6789/0
   -30> 2013-05-15 12:20:40.426019 7ffff36e7700 10 monclient: _send_mon_message to mon.cs30 at 172.17.131.30:6789/0
   -29> 2013-05-15 12:20:40.426093 7ffff36e7700 10 monclient: renew_subs
   -28> 2013-05-15 12:20:40.426530 7ffff36e7700  5 mds.0.9 ms_handle_connect on 172.17.131.30:6789/0
   -27> 2013-05-15 12:20:40.426600 7ffff36e7700  5 mds.0.9 is_laggy 91.408902 > 15 since last acked beacon
   -26> 2013-05-15 12:20:40.426615 7ffff36e7700  5 mds.0.9 initiating monitor reconnect; maybe we're not the slow one
   -25> 2013-05-15 12:20:40.426617 7ffff36e7700 10 monclient: _reopen_session
   -24> 2013-05-15 12:20:40.426679 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs28 con 0x7cfe1a0 addr 172.17.131.28:6789/0
   -23> 2013-05-15 12:20:40.426698 7ffff36e7700 10 monclient: _send_mon_message to mon.cs28 at 172.17.131.28:6789/0
   -22> 2013-05-15 12:20:40.426779 7ffff36e7700 10 monclient: renew_subs
   -21> 2013-05-15 12:20:40.426843 7ffff36e7700  5 mds.0.9 is_laggy 91.409145 > 15 since last acked beacon
   -20> 2013-05-15 12:20:40.426875 7ffff36e7700  5 mds.0.9 initiating monitor reconnect; maybe we're not the slow one
   -19> 2013-05-15 12:20:40.426877 7ffff36e7700 10 monclient: _reopen_session
   -18> 2013-05-15 12:20:40.426957 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs30 con 0x1508b00 addr 172.17.131.30:6789/0
   -17> 2013-05-15 12:20:40.426976 7ffff36e7700 10 monclient: _send_mon_message to mon.cs30 at 172.17.131.30:6789/0
   -16> 2013-05-15 12:20:40.427090 7ffff36e7700 10 monclient: renew_subs
   -15> 2013-05-15 12:20:40.427146 7ffff36e7700  5 mds.0.9 is_laggy 91.409449 > 15 since last acked beacon
   -14> 2013-05-15 12:20:40.427160 7ffff36e7700  5 mds.0.9 initiating monitor reconnect; maybe we're not the slow one
   -13> 2013-05-15 12:20:40.427163 7ffff36e7700 10 monclient: _reopen_session
   -12> 2013-05-15 12:20:40.427225 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs28 con 0x7cfb2e0 addr 172.17.131.28:6789/0
   -11> 2013-05-15 12:20:40.427330 7ffff36e7700 10 monclient: _send_mon_message to mon.cs28 at 172.17.131.28:6789/0
   -10> 2013-05-15 12:20:40.427455 7ffff36e7700 10 monclient: renew_subs
    -9> 2013-05-15 12:20:40.427986 7ffff36e7700  5 mds.0.9 ms_handle_connect on 172.17.131.28:6789/0
    -8> 2013-05-15 12:20:40.428043 7ffff36e7700  5 mds.0.9 is_laggy 91.410346 > 15 since last acked beacon
    -7> 2013-05-15 12:20:40.428057 7ffff36e7700  5 mds.0.9 initiating monitor reconnect; maybe we're not the slow one
    -6> 2013-05-15 12:20:40.428059 7ffff36e7700 10 monclient: _reopen_session
    -5> 2013-05-15 12:20:40.428117 7ffff36e7700 10 monclient: _pick_new_mon picked mon.cs29 con 0x7cfb180 addr 172.17.131.29:6789/0
    -4> 2013-05-15 12:20:40.428134 7ffff36e7700 10 monclient: _send_mon_message to mon.cs29 at 172.17.131.29:6789/0
    -3> 2013-05-15 12:20:40.428139 7ffff36e7700 10 monclient: renew_subs
    -2> 2013-05-15 12:20:40.428368 7ffff36e7700  0 mds.0.cache recovery error! -108
    -1> 2013-05-15 12:20:40.428435 7ffff36e7700  1 mds.0.9 suicide.  wanted down:dne, now up:active
     0> 2013-05-15 12:20:40.505064 7ffff36e7700 -1 mds/MDCache.cc: In function 'void MDCache::_recovered(CInode*, int, uint64_t, utime_t)' thread 7ffff36e7700 time 2013-05-15 12:20:40.501908
mds/MDCache.cc: 5658: FAILED assert(0)

 ceph version 0.61.2-3-g40f8381 (40f838143a2a647f61f95b85a214695af1d9d1d5)
 1: (MDCache::_recovered(CInode*, int, unsigned long, utime_t)+0x109) [0x577e89]
 2: (Filer::_probed(Filer::Probe*, object_t const&, unsigned long, utime_t)+0xb3a) [0x72d6aa]
 3: (Objecter::C_Stat::finish(int)+0xc9) [0x72e079]
 4: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe58) [0x71e478]
 5: (MDS::handle_core_message(Message*)+0x957) [0x4cae07]
 6: (MDS::_dispatch(Message*)+0x2f) [0x4caebf]
 7: (MDS::ms_dispatch(Message*)+0x1fb) [0x4cc9eb]
 8: (DispatchQueue::entry()+0x3e1) [0x83ccf1]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x79973d]
 10: (()+0x77f1) [0x7ffff76507f1]
 11: (clone()+0x6d) [0x7ffff6ca5ccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/ceph/log/mds.cs30.log
--- end dump of recent events ---
2013-05-15 12:20:40.793947 7ffff36e7700 -1 *** Caught signal (Aborted) **
 in thread 7ffff36e7700

 ceph version 0.61.2-3-g40f8381 (40f838143a2a647f61f95b85a214695af1d9d1d5)
 1: /usr/bin/ceph-mds() [0x85e0c9]
 2: (()+0xf4a0) [0x7ffff76584a0]
 3: (gsignal()+0x35) [0x7ffff6bf2885]
 4: (abort()+0x175) [0x7ffff6bf4065]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3be84bea7d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2013-05-15 12:20:40.793947 7ffff36e7700 -1 *** Caught signal (Aborted) **
 in thread 7ffff36e7700

 ceph version 0.61.2-3-g40f8381 (40f838143a2a647f61f95b85a214695af1d9d1d5)
 1: /usr/bin/ceph-mds() [0x85e0c9]
 2: (()+0xf4a0) [0x7ffff76584a0]
 3: (gsignal()+0x35) [0x7ffff6bf2885]
 4: (abort()+0x175) [0x7ffff6bf4065]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3be84bea7d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/ceph/log/mds.cs30.log
--- end dump of recent events ---

gdb had this to say:

Program terminated with signal 6, Aborted.
#0  0x00007ffff765836b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42

warning: Source file is more recent than executable.
42                 sig);
Missing separate debuginfos, use: debuginfo-install gperftools-libs-2.0-3.el6.2.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 libuuid-2.17.2-12.4.el6.x86_64 nspr-4.9-1.el6.x86_64 nss-3.13.3-8.el6.x86_64 nss-softokn-3.12.9-11.el6.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.13.3-2.el6.x86_64 sqlite-3.6.20-1.el6.x86_64
(gdb) bt
#0  0x00007ffff765836b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x000000000085e4f7 in reraise_fatal (signum=6) at global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  <signal handler called>
#4  0x00007ffff6bf2885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007ffff6bf4065 in abort () at abort.c:92
#6  0x0000003be84bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#7  0x0000003be84bcc06 in ?? () from /usr/lib64/libstdc++.so.6
#8  0x0000003be84bcc33 in std::terminate() () from /usr/lib64/libstdc++.so.6
#9  0x0000003be84bcd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#10 0x00000000007552c9 in ceph::__ceph_assert_fail (assertion=0x1438000 "\001", file=0x4c83dc0 "P|\310\004", line=5658, 
    func=0x877c20 "void MDCache::_recovered(CInode*, int, uint64_t, utime_t)") at common/assert.cc:77
#11 0x0000000000577e89 in MDCache::_recovered (this=0x1570000, in=<value optimized out>, r=-108, size=0, mtime=<value optimized out>) at mds/MDCache.cc:5658
#12 0x000000000072d6aa in Filer::_probed (this=0x13f8140, probe=0x1540690, oid=..., size=0, mtime=...) at osdc/Filer.cc:134
#13 0x000000000072e079 in Objecter::C_Stat::finish (this=0x2780380, r=-108) at osdc/Objecter.h:796
#14 0x000000000071e478 in Objecter::handle_osd_op_reply (this=0x1568000, m=0x13f36600) at osdc/Objecter.cc:1480
#15 0x00000000004cae07 in MDS::handle_core_message (this=0x14f8b00, m=0x13f36600) at mds/MDS.cc:1733
#16 0x00000000004caebf in MDS::_dispatch (this=0x14f8b00, m=0x13f36600) at mds/MDS.cc:1857
#17 0x00000000004cc9eb in MDS::ms_dispatch (this=0x14f8b00, m=0x13f36600) at mds/MDS.cc:1668
#18 0x000000000083ccf1 in ms_deliver_dispatch (this=0x14f80e8) at msg/Messenger.h:566
#19 DispatchQueue::entry (this=0x14f80e8) at msg/DispatchQueue.cc:119
#20 0x000000000079973d in DispatchQueue::DispatchThread::entry (this=<value optimized out>) at msg/DispatchQueue.h:104
#21 0x00007ffff76507f1 in start_thread (arg=0x7ffff36e7700) at pthread_create.c:301
#22 0x00007ffff6ca5ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) f 11
#11 0x0000000000577e89 in MDCache::_recovered (this=0x1570000, in=<value optimized out>, r=-108, size=0, mtime=<value optimized out>) at mds/MDCache.cc:5658
5658        assert(0);
(gdb) l
5653    
5654      if (r != 0) {
5655        dout(0) << "recovery error! " << r << dendl;
5656        if (r == -EBLACKLISTED)
5657          mds->suicide();
5658        assert(0);
5659      }
5660    
5661      file_recovering.erase(in);
5662      in->state_clear(CInode::STATE_RECOVERING);
(gdb) f 12
#12 0x000000000072d6aa in Filer::_probed (this=0x13f8140, probe=0x1540690, oid=..., size=0, mtime=...) at osdc/Filer.cc:134
134        probe->onfinish->finish(probe->err);
(gdb) l
129    
130      if (!probe->ops.empty()) 
131        return;  // waiting for more!
132    
133      if (probe->err) { // we hit an error, propagate back up
134        probe->onfinish->finish(probe->err);
135        delete probe->onfinish;
136        delete probe;
137        return;
138      }
(gdb) f 13
#13 0x000000000072e079 in Objecter::C_Stat::finish (this=0x2780380, r=-108) at osdc/Objecter.h:796
796          fin->finish(r);
(gdb) l
791        if (psize)
792          *psize = s;
793        if (pmtime)
794          *pmtime = m;
795          }
796          fin->finish(r);
797          delete fin;
798        }
799      };
800        
(gdb) f 14
#14 0x000000000071e478 in Objecter::handle_osd_op_reply (this=0x1568000, m=0x13f36600) at osdc/Objecter.cc:1480
1480        onack->finish(rc);
(gdb) l
1475      
1476      ldout(cct, 5) << num_unacked << " unacked, " << num_uncommitted << " uncommitted" << dendl;
1477    
1478      // do callbacks
1479      if (onack) {
1480        onack->finish(rc);
1481        delete onack;
1482      }
1483      if (oncommit) {
1484        oncommit->finish(rc);
(gdb)

Actions #1

Updated by Sage Weil almost 11 years ago

  • Status changed from New to Resolved

thanks, this one was easy to fix.

commit:64871e093159ad06d84fb2a84c7808a81800dfc4

Actions #2

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF