Project

General

Profile

Bug #22610

MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache

Added by Jianyu Li about 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We use two active MDS in our online environment, recently mds.1 restarted and during its rejoin phase, mds.0 met assert failure when processing the weak rejoin request from mds.1, below is the log snip:

-2> 2018-01-04 20:50:50.638943 7f9fb9cfb700 5 mds.mmcommcephsz11 handle_mds_map epoch 694747 from mds.1
-1> 2018-01-04 20:50:50.638952 7f9fb9cfb700 5 mds.mmcommcephsz11 old map epoch 694747 <= 694747, discarding
0> 2018-01-04 20:50:50.652715 7f9fb9cfb700 -1 mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)' thread 7f9fb9cfb700 time 2018-01-04 20:50:50.650286
mds/MDCache.cc: 4325: FAILED assert(in && in->is_auth())

ceph version 10.2.9-102-g820619c (820619cc59a3790ab36be1945a135eb826c558f1)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f9fc0981205]
2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x63e) [0x7f9fc068759e]
3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x25b) [0x7f9fc068d04b]
4: (MDCache::dispatch(Message*)+0xa5) [0x7f9fc069d975]
5: (MDSRank::handle_deferrable_message(Message*)+0x5ef) [0x7f9fc058792f]
6: (MDSRank::_dispatch(Message*, bool)+0x1e0) [0x7f9fc0592250]
7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x7f9fc05933e5]
8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x7f9fc0578a03]
9: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x7f9fc0af2017]
10: (C_handle_dispatch::do_request(int)+0x11) [0x7f9fc0af2761]
11: (EventCenter::process_events(int)+0x90a) [0x7f9fc0a92aba]
12: (Worker::entry()+0x1f0) [0x7f9fc0a68170]
13: (()+0x7dc5) [0x7f9fbf755dc5]
14: (clone()+0x6d) [0x7f9fbe22129d]

After checking the related code, it seems that the assert(in && in->is_auth()) is too strict, because the inode for this cap_export maybe expired from Cache, and change the assert into assert(!in || in->is_auth) is more reasonable.


Related issues

Copied to CephFS - Backport #22867: luminous: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache Resolved
Copied to CephFS - Backport #22868: jewel: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache Rejected

History

#2 Updated by Patrick Donnelly about 6 years ago

  • Status changed from New to In Progress
  • Assignee set to Jianyu Li

#3 Updated by Zheng Yan about 6 years ago

  • Status changed from In Progress to Fix Under Review

#4 Updated by Patrick Donnelly about 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to jewel,luminous

#5 Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #22867: luminous: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache added

#6 Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #22868: jewel: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache added

#7 Updated by Patrick Donnelly about 6 years ago

  • Backport changed from jewel,luminous to luminous

#8 Updated by Nathan Cutler about 6 years ago

  • Backport changed from luminous to luminous jewel

Re-adding rejected jewel backport to appease backport tooling.

#9 Updated by Nathan Cutler about 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF