Project

General

Profile

Actions

Bug #22610

closed

MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache

Added by Jianyu Li over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We use two active MDS in our online environment, recently mds.1 restarted and during its rejoin phase, mds.0 met assert failure when processing the weak rejoin request from mds.1, below is the log snip:

2> 2018-01-04 20:50:50.638943 7f9fb9cfb700 5 mds.mmcommcephsz11 handle_mds_map epoch 694747 from mds.1
-1> 2018-01-04 20:50:50.638952 7f9fb9cfb700 5 mds.mmcommcephsz11 old map epoch 694747 <= 694747, discarding
0> 2018-01-04 20:50:50.652715 7f9fb9cfb700 -1 mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)' thread 7f9fb9cfb700 time 2018-01-04 20:50:50.650286
mds/MDCache.cc: 4325: FAILED assert(in && in
>is_auth())

ceph version 10.2.9-102-g820619c (820619cc59a3790ab36be1945a135eb826c558f1)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f9fc0981205]
2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x63e) [0x7f9fc068759e]
3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x25b) [0x7f9fc068d04b]
4: (MDCache::dispatch(Message*)+0xa5) [0x7f9fc069d975]
5: (MDSRank::handle_deferrable_message(Message*)+0x5ef) [0x7f9fc058792f]
6: (MDSRank::_dispatch(Message*, bool)+0x1e0) [0x7f9fc0592250]
7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x7f9fc05933e5]
8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x7f9fc0578a03]
9: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x7f9fc0af2017]
10: (C_handle_dispatch::do_request(int)+0x11) [0x7f9fc0af2761]
11: (EventCenter::process_events(int)+0x90a) [0x7f9fc0a92aba]
12: (Worker::entry()+0x1f0) [0x7f9fc0a68170]
13: (()+0x7dc5) [0x7f9fbf755dc5]
14: (clone()+0x6d) [0x7f9fbe22129d]

After checking the related code, it seems that the assert(in && in->is_auth()) is too strict, because the inode for this cap_export maybe expired from Cache, and change the assert into assert(!in || in->is_auth) is more reasonable.


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #22867: luminous: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCacheResolvedPrashant DActions
Copied to CephFS - Backport #22868: jewel: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCacheRejectedActions
Actions #1

Updated by Jianyu Li over 6 years ago

Actions #2

Updated by Patrick Donnelly over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to Jianyu Li
Actions #3

Updated by Zheng Yan over 6 years ago

  • Status changed from In Progress to Fix Under Review
Actions #4

Updated by Patrick Donnelly about 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to jewel,luminous
Actions #5

Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #22867: luminous: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache added
Actions #6

Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #22868: jewel: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache added
Actions #7

Updated by Patrick Donnelly about 6 years ago

  • Backport changed from jewel,luminous to luminous
Actions #8

Updated by Nathan Cutler about 6 years ago

  • Backport changed from luminous to luminous jewel

Re-adding rejected jewel backport to appease backport tooling.

Actions #9

Updated by Nathan Cutler about 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF