Project

General

Profile

Actions

Bug #22626

closed

mds: sessionmap version mismatch when replay esessions

Added by Zhi Zhang over 6 years ago. Updated about 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We used ceph 10.2.10 and backported this PR: https://github.com/ceph/ceph/commit/a49726e10ef23be124d92872470fd258a1938d9e#diff-23bc98a965757649a7e2d936e1eb7092 long time ago. Our cluster had been running well under multi-MDS for a long time until we hit following crash recently and had to reset journal to start MDS again.

2018-01-07 15:45:21.353734 7f4e7e6d2700 10 mds.0.sessionmap _load_finish loaded version 450929328
2018-01-07 15:45:21.356874 7f4e7e6d2700 10 mds.0.sessionmap _load_finish: continue omap load from 'client.156699143'
2018-01-07 15:45:21.359749 7f4e7e6d2700 10 MDSIOContextBase::complete: 12C_IO_SM_Load
2018-01-07 15:45:21.360497 7f4e7e6d2700 10 mds.0.sessionmap _load_finish: omap load complete
2018-01-07 15:45:21.360528 7f4e7e6d2700 10 mds.0.sessionmap _load_finish: v 450929328, 1360 sessions
...
2018-01-07 15:45:24.134781 7f4e7c6ce700 10 mds.0.journal ESession.replay inotable 909867 < 909868 remove
2018-01-07 15:45:24.134783 7f4e7c6ce700 10 mds.0.inotable: replay_release_ids [1001a2e4dd7~134,1001a2f320f~1f5]
2018-01-07 15:45:24.134787 7f4e7c6ce700 10 mds.0.log _replay 8133887349887~198 / 8133887835152 2018-01-07 14:37:24.997607: ESession client.138079503 x.x.x.x:0/679872855 close cmapv 450928332
2018-01-07 15:45:24.134790 7f4e7c6ce700 10 mds.0.journal ESession.replay sessionmap 450929328 >= 450928332, noop
2018-01-07 15:45:24.134792 7f4e7c6ce700 10 mds.0.log _replay 8133887350105~198 / 8133887835152 2018-01-07 14:37:24.997614: ESession client.155696931 x.x.x.x:0/4292818376 close cmapv 450928333
2018-01-07 15:45:24.134796 7f4e7c6ce700 10 mds.0.journal ESession.replay sessionmap 450929328 >= 450928333, noop
2018-01-07 15:45:24.134797 7f4e7c6ce700 10 mds.0.log _replay 8133887350323~198 / 8133887835152 2018-01-07 14:37:24.997618: ESession client.155699679 x.x.x.x:0/375524559 close cmapv 450928334
2018-01-07 15:45:24.134801 7f4e7c6ce700 10 mds.0.journal ESession.replay sessionmap 450929328 >= 450928334, noop
2018-01-07 15:45:24.134803 7f4e7c6ce700 10 mds.0.log _replay 8133887350541~198 / 8133887835152 2018-01-07 14:37:24.997622: ESession client.156692858 x.x.x.x:0/383900987 close cmapv 450928335
2018-01-07 15:45:24.134810 7f4e7c6ce700 10 mds.0.journal ESession.replay sessionmap 450929328 >= 450928335, noop
2018-01-07 15:45:24.135005 7f4e7c6ce700 10 mds.0.log _replay 8133887350759~155947 / 8133887835152 2018-01-07 14:37:25.553532: ESessions 1019 opens cmapv 450929354
2018-01-07 15:45:24.135010 7f4e7c6ce700 10 mds.0.journal ESessions.replay sessionmap 450929328 < 450929354
2018-01-07 15:45:24.135273 7f4e7c6ce700 10 mds.0.journal ESessions.replay after open_sessions sessionmap 450930347 cmapv 450929354
2018-01-07 15:45:24.136229 7f4e7c6ce700 -1 mds/journal.cc: In function 'virtual void ESessions::replay(MDSRank*)' thread 7f4e7c6ce700 time 2018-01-07 15:45:24.135277
mds/journal.cc: 1850: FAILED assert(mds->sessionmap.get_version() == cmapv)

 ceph version 10.2.10-102-g0b468fc (0b468fcd815759a473385384686d6a3ee6063f41)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f4e8a3301ab]
 2: (ESessions::replay(MDSRank*)+0x9f6) [0x7f4e8a210b66]
 3: (MDLog::_replay_thread()+0x5df) [0x7f4e8a1a6eaf]
 4: (MDLog::ReplayThread::entry()+0xd) [0x7f4e89f6ed4d]
 5: (()+0x7df3) [0x7f4e8912adf3]
 6: (clone()+0x6d) [0x7f4e87bf81bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

We have seen this crash before which was the sessionmap version only increased once after open_sessions. That is why we backported above PR into 10.2.x. But right now as seen above, sessionmap version after open_sessions is larger than cmapv.

Actions #1

Updated by Patrick Donnelly over 6 years ago

  • Status changed from New to Rejected

Zhang, we are not accepting bugs for multimds clusters on jewel. You can still seek help/advice on ceph-users if you like.

We would recommend upgrading to Luminous if that is possible.

Actions #2

Updated by Patrick Donnelly about 5 years ago

  • Category deleted (90)
  • Labels (FS) multimds added
Actions

Also available in: Atom PDF