Bug #38835
closedMDSTableServer.cc: 83: FAILED assert(version == tid)
0%
Description
We just hit this on a v13.2.5 cluster with 1 active MDS:
-5> 2019-03-21 10:15:33.943 7fd43241a700 10 monclient: _send_mon_message to mon.cephkelly-mon-39bee08afe at 137.138.64.13:6789/0 -4> 2019-03-21 10:15:35.544 7fd43441e700 10 monclient: tick -3> 2019-03-21 10:15:35.544 7fd43441e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-03-21 10:15:05.5 44674) -2> 2019-03-21 10:15:35.856 7fd432c1b700 2 mds.0.cache check_memory_usage total 455120, rss 28648, heap 323804, baseline 323804, buffers 0, 16 / 28 inodes have caps, 35 caps, 1.25 caps per inode -1> 2019-03-21 10:15:37.943 7fd43241a700 10 monclient: _send_mon_message to mon.cephkelly-mon-39bee08afe at 137.138.64.13:6789/0 0> 2019-03-21 10:15:40.087 7fd42f414700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIS T/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/mds/MDSTableServer.cc: In function 'void MDSTableServer: :_prepare_logged(MMDSTableRequest*, version_t)' thread 7fd42f414700 time 2019-03-21 10:15:40.085937 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release /13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/mds/MDSTableServer.cc: 83: FAILED assert(version == tid) ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7fd43ecddfbf] 2: (()+0x26d187) [0x7fd43ecde187] 3: (MDSTableServer::_prepare_logged(MMDSTableRequest*, unsigned long)+0x5d1) [0x5573985bf791] 4: (MDSIOContextBase::complete(int)+0x119) [0x5573985ec3d9] 5: (MDSLogContextBase::complete(int)+0x40) [0x5573985ec560] 6: (Finisher::finisher_thread_entry()+0x12e) [0x7fd43ecdc53e] 7: (()+0x7dd5) [0x7fd43c91edd5] 8: (clone()+0x6d) [0x7fd43b9fbead] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
The coredump is available at ceph-post-file: aea2e0f9-ed81-4b45-ab90-e084e0ca184e
Updated by Patrick Donnelly about 5 years ago
- Priority changed from Normal to High
- Target version set to v15.0.0
- Start date deleted (
03/21/2019) - Backport set to nautilus,mimic,luminous
- Labels (FS) crash added
Updated by Zheng Yan about 5 years ago
I have trouble to check the coredump file. please use gdb to print 'tid' and 'version' if you can.
Updated by Dan van der Ster about 5 years ago
Zheng Yan wrote:
I have trouble to check the coredump file. please use gdb to print 'tid' and 'version' if you can.
(gdb) p tid $1 = <optimized out> (gdb) p version $2 = 2
Another way to get the tid?
Updated by Zheng Yan about 5 years ago
goto frame 4 and try casting MDSIOContextBase into C_Prepare. tid is stored in C_Prepare
was the cluster upgraded from luminous? was it the first time that mimic version mds ran?
If both are yes, following change can explain the asserion
diff --git a/src/mds/SnapServer.h b/src/mds/SnapServer.h index 0fee9db9a1..a3b430f1d3 100644 --- a/src/mds/SnapServer.h +++ b/src/mds/SnapServer.h @@ -105,6 +105,7 @@ public: if (get_version() == 0) { // version 0 confuses snapclient code reset_state(); + projected_version = version; upgraded = true; } if (snaprealm_v2_since == CEPH_NOSNAP) {
Updated by Dan van der Ster about 5 years ago
Yes this was upgraded from luminous and yes that was the first time the MDS was running in mimic.
(gdb) up #9 0x00005573985ec3d9 in complete (r=0, this=0x55739a114b40) at /usr/src/debug/ceph-13.2.5/src/include/Context.h:77 77 finish(r); (gdb) list 72 73 public: 74 Context() {} 75 virtual ~Context() {} // we want a virtual destructor!!! 76 virtual void complete(int r) { 77 finish(r); 78 delete this; 79 } 80 virtual bool sync_complete(int r) { 81 if (sync_finish(r)) { (gdb) p ((C_Prepare *)this)->tid $3 = 1
Updated by Zheng Yan about 5 years ago
after restarting the mds, the crash will not happen again.
Updated by Zheng Yan about 5 years ago
- Backport changed from nautilus,mimic,luminous to nautilus,mimic
Updated by Nathan Cutler about 5 years ago
- Status changed from New to Fix Under Review
Updated by Patrick Donnelly about 5 years ago
- Status changed from Fix Under Review to Pending Backport
- Assignee set to Zheng Yan
Updated by Nathan Cutler about 5 years ago
- Copied to Backport #39211: nautilus: MDSTableServer.cc: 83: FAILED assert(version == tid) added
Updated by Nathan Cutler about 5 years ago
- Copied to Backport #39212: mimic: MDSTableServer.cc: 83: FAILED assert(version == tid) added
Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".