Project

General

Profile

Actions

Bug #46906

closed

mds: fix file recovery crash after replaying delayed requests

Added by Zhi Zhang over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When client replay stage or active stage just started, MDS replayed delayed requests firstly, then tried to recover files which had no caps during rejoin. But those files might get caps after repayling requests.

2020-08-07 23:01:01.630609 7f374ccfc700  1 mds.2.43320 active_start
2020-08-07 23:01:02.606164 7f374f473700  0 -- xxx:6800/2408984685 >> - conn(0x7f38f6e21800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection accept peer addr is really xxx:0/2314926973 (socket is -)
2020-08-07 23:01:08.992473 7f374ccfc700 -1 /root/rpmbuild/BUILD/ceph-12.2.12-459-gb23a1c3/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7f374ccfc700 time 2020-08-07 23:01:08.989966
/root/rpmbuild/BUILD/ceph-12.2.12-459-gb23a1c3/src/mds/Locker.cc: 5142: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)

 ceph version 12.2.12-459-gb23a1c3 (b23a1c3c1eec9c367634f72129001975ff218df0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f37551d00f0]
 2: (Locker::file_recover(ScatterLock*)+0x1f9) [0x7f37550240c9]
 3: (MDCache::start_files_to_recover()+0xc3) [0x7f3754f395b3]
 4: (MDSRank::active_start()+0xa6) [0x7f3754e6e4b6]
 5: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x1307) [0x7f3754e83147]
 6: (MDSDaemon::handle_mds_map(MMDSMap*)+0xdcd) [0x7f3754e5aced]
 7: (MDSDaemon::handle_core_message(Message*)+0x7ab) [0x7f3754e5fb6b]
 8: (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f3754e5fe0b]
 9: (DispatchQueue::entry()+0x792) [0x7f37554dac32]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f375525847d]
 11: (()+0x7e25) [0x7f3752afce25]
 12: (clone()+0x6d) [0x7f3751be035d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #48095: nautilus: mds: fix file recovery crash after replaying delayed requestsResolvedWei-Chung ChengActions
Copied to CephFS - Backport #48096: octopus: mds: fix file recovery crash after replaying delayed requestsResolvedWei-Chung ChengActions
Actions

Also available in: Atom PDF