Project

General

Profile

Actions

Bug #38835

closed

MDSTableServer.cc: 83: FAILED assert(version == tid)

Added by Dan van der Ster about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We just hit this on a v13.2.5 cluster with 1 active MDS:

    -5> 2019-03-21 10:15:33.943 7fd43241a700 10 monclient: _send_mon_message to mon.cephkelly-mon-39bee08afe at 137.138.64.13:6789/0
    -4> 2019-03-21 10:15:35.544 7fd43441e700 10 monclient: tick
    -3> 2019-03-21 10:15:35.544 7fd43441e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-03-21 10:15:05.5
44674)
    -2> 2019-03-21 10:15:35.856 7fd432c1b700  2 mds.0.cache check_memory_usage total 455120, rss 28648, heap 323804, baseline 323804, buffers 
0, 16 / 28 inodes have caps, 35 caps, 1.25 caps per inode
    -1> 2019-03-21 10:15:37.943 7fd43241a700 10 monclient: _send_mon_message to mon.cephkelly-mon-39bee08afe at 137.138.64.13:6789/0
     0> 2019-03-21 10:15:40.087 7fd42f414700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIS
T/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/mds/MDSTableServer.cc: In function 'void MDSTableServer:
:_prepare_logged(MMDSTableRequest*, version_t)' thread 7fd42f414700 time 2019-03-21 10:15:40.085937

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release
/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/mds/MDSTableServer.cc: 83: FAILED assert(version == tid)

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7fd43ecddfbf]
 2: (()+0x26d187) [0x7fd43ecde187]
 3: (MDSTableServer::_prepare_logged(MMDSTableRequest*, unsigned long)+0x5d1) [0x5573985bf791]
 4: (MDSIOContextBase::complete(int)+0x119) [0x5573985ec3d9]
 5: (MDSLogContextBase::complete(int)+0x40) [0x5573985ec560]
 6: (Finisher::finisher_thread_entry()+0x12e) [0x7fd43ecdc53e]
 7: (()+0x7dd5) [0x7fd43c91edd5]
 8: (clone()+0x6d) [0x7fd43b9fbead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The coredump is available at ceph-post-file: aea2e0f9-ed81-4b45-ab90-e084e0ca184e


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #39211: nautilus: MDSTableServer.cc: 83: FAILED assert(version == tid)ResolvedWei-Chung ChengActions
Copied to CephFS - Backport #39212: mimic: MDSTableServer.cc: 83: FAILED assert(version == tid)ResolvedNathan CutlerActions
Actions #1

Updated by Patrick Donnelly about 5 years ago

  • Priority changed from Normal to High
  • Target version set to v15.0.0
  • Start date deleted (03/21/2019)
  • Backport set to nautilus,mimic,luminous
  • Labels (FS) crash added
Actions #2

Updated by Zheng Yan about 5 years ago

I have trouble to check the coredump file. please use gdb to print 'tid' and 'version' if you can.

Actions #3

Updated by Dan van der Ster about 5 years ago

Zheng Yan wrote:

I have trouble to check the coredump file. please use gdb to print 'tid' and 'version' if you can.

(gdb) p tid
$1 = <optimized out>
(gdb) p version
$2 = 2

Another way to get the tid?

Actions #4

Updated by Zheng Yan about 5 years ago

goto frame 4 and try casting MDSIOContextBase into C_Prepare. tid is stored in C_Prepare

was the cluster upgraded from luminous? was it the first time that mimic version mds ran?

If both are yes, following change can explain the asserion

diff --git a/src/mds/SnapServer.h b/src/mds/SnapServer.h
index 0fee9db9a1..a3b430f1d3 100644
--- a/src/mds/SnapServer.h
+++ b/src/mds/SnapServer.h
@@ -105,6 +105,7 @@ public:
     if (get_version() == 0) {
       // version 0 confuses snapclient code
       reset_state();
+      projected_version = version;
       upgraded = true;
     }
     if (snaprealm_v2_since == CEPH_NOSNAP) {

Actions #5

Updated by Dan van der Ster about 5 years ago

Yes this was upgraded from luminous and yes that was the first time the MDS was running in mimic.

(gdb) up
#9  0x00005573985ec3d9 in complete (r=0, this=0x55739a114b40) at /usr/src/debug/ceph-13.2.5/src/include/Context.h:77
77        finish(r);
(gdb) list
72    
73     public:
74      Context() {}
75      virtual ~Context() {}       // we want a virtual destructor!!!
76      virtual void complete(int r) {
77        finish(r);
78        delete this;
79      }
80      virtual bool sync_complete(int r) {
81        if (sync_finish(r)) {
(gdb) p ((C_Prepare *)this)->tid
$3 = 1
Actions #6

Updated by Zheng Yan about 5 years ago

after restarting the mds, the crash will not happen again.

Actions #7

Updated by Zheng Yan about 5 years ago

  • Backport changed from nautilus,mimic,luminous to nautilus,mimic
Actions #8

Updated by Nathan Cutler about 5 years ago

  • Status changed from New to Fix Under Review
Actions #9

Updated by Nathan Cutler about 5 years ago

  • Pull request ID set to 27238
Actions #10

Updated by Patrick Donnelly about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Assignee set to Zheng Yan
Actions #11

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39211: nautilus: MDSTableServer.cc: 83: FAILED assert(version == tid) added
Actions #12

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39212: mimic: MDSTableServer.cc: 83: FAILED assert(version == tid) added
Actions #13

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF