Project

General

Profile

Bug #22610

MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache

Added by Jianyu Li about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
01/08/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:

Description

We use two active MDS in our online environment, recently mds.1 restarted and during its rejoin phase, mds.0 met assert failure when processing the weak rejoin request from mds.1, below is the log snip:

-2> 2018-01-04 20:50:50.638943 7f9fb9cfb700 5 mds.mmcommcephsz11 handle_mds_map epoch 694747 from mds.1
-1> 2018-01-04 20:50:50.638952 7f9fb9cfb700 5 mds.mmcommcephsz11 old map epoch 694747 <= 694747, discarding
0> 2018-01-04 20:50:50.652715 7f9fb9cfb700 -1 mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)' thread 7f9fb9cfb700 time 2018-01-04 20:50:50.650286
mds/MDCache.cc: 4325: FAILED assert(in && in->is_auth())

ceph version 10.2.9-102-g820619c (820619cc59a3790ab36be1945a135eb826c558f1)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f9fc0981205]
2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x63e) [0x7f9fc068759e]
3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x25b) [0x7f9fc068d04b]
4: (MDCache::dispatch(Message*)+0xa5) [0x7f9fc069d975]
5: (MDSRank::handle_deferrable_message(Message*)+0x5ef) [0x7f9fc058792f]
6: (MDSRank::_dispatch(Message*, bool)+0x1e0) [0x7f9fc0592250]
7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x7f9fc05933e5]
8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x7f9fc0578a03]
9: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x7f9fc0af2017]
10: (C_handle_dispatch::do_request(int)+0x11) [0x7f9fc0af2761]
11: (EventCenter::process_events(int)+0x90a) [0x7f9fc0a92aba]
12: (Worker::entry()+0x1f0) [0x7f9fc0a68170]
13: (()+0x7dc5) [0x7f9fbf755dc5]
14: (clone()+0x6d) [0x7f9fbe22129d]

After checking the related code, it seems that the assert(in && in->is_auth()) is too strict, because the inode for this cap_export maybe expired from Cache, and change the assert into assert(!in || in->is_auth) is more reasonable.


Related issues

Copied to fs - Backport #22867: luminous: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache Resolved
Copied to fs - Backport #22868: jewel: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache Rejected

History

#2 Updated by Patrick Donnelly about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Jianyu Li

#3 Updated by Zheng Yan about 1 year ago

  • Status changed from In Progress to Need Review

#4 Updated by Patrick Donnelly about 1 year ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel,luminous

#5 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #22867: luminous: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache added

#6 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #22868: jewel: MDS: assert failure when the inode for the cap_export from other MDS happened not in MDCache added

#7 Updated by Patrick Donnelly about 1 year ago

  • Backport changed from jewel,luminous to luminous

#8 Updated by Nathan Cutler about 1 year ago

  • Backport changed from luminous to luminous jewel

Re-adding rejected jewel backport to appease backport tooling.

#9 Updated by Nathan Cutler about 1 year ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF