Bug #46906: mds: fix file recovery crash after replaying delayed requests - CephFS - Ceph

Actions

Copy link

Bug #46906

closed

mds: fix file recovery crash after replaying delayed requests

Added by Zhi Zhang over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Correctness/Safety

Target version:

Ceph - v16.0.0

% Done:

Source:

Community (dev)

Tags:

Backport:

octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

multimds

Pull request ID:

36532

Crash signature (v1):

Crash signature (v2):

Description

When client replay stage or active stage just started, MDS replayed delayed requests firstly, then tried to recover files which had no caps during rejoin. But those files might get caps after repayling requests.

2020-08-07 23:01:01.630609 7f374ccfc700  1 mds.2.43320 active_start
2020-08-07 23:01:02.606164 7f374f473700  0 -- xxx:6800/2408984685 >> - conn(0x7f38f6e21800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection accept peer addr is really xxx:0/2314926973 (socket is -)
2020-08-07 23:01:08.992473 7f374ccfc700 -1 /root/rpmbuild/BUILD/ceph-12.2.12-459-gb23a1c3/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7f374ccfc700 time 2020-08-07 23:01:08.989966
/root/rpmbuild/BUILD/ceph-12.2.12-459-gb23a1c3/src/mds/Locker.cc: 5142: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)

 ceph version 12.2.12-459-gb23a1c3 (b23a1c3c1eec9c367634f72129001975ff218df0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f37551d00f0]
 2: (Locker::file_recover(ScatterLock*)+0x1f9) [0x7f37550240c9]
 3: (MDCache::start_files_to_recover()+0xc3) [0x7f3754f395b3]
 4: (MDSRank::active_start()+0xa6) [0x7f3754e6e4b6]
 5: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x1307) [0x7f3754e83147]
 6: (MDSDaemon::handle_mds_map(MMDSMap*)+0xdcd) [0x7f3754e5aced]
 7: (MDSDaemon::handle_core_message(Message*)+0x7ab) [0x7f3754e5fb6b]
 8: (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f3754e5fe0b]
 9: (DispatchQueue::entry()+0x792) [0x7f37554dac32]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f375525847d]
 11: (()+0x7e25) [0x7f3752afce25]
 12: (clone()+0x6d) [0x7f3751be035d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Zhi Zhang over 3 years ago

Pull request ID set to 36575

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from New to Fix Under Review
Target version set to v16.0.0
Backport set to octopus,nautilus
Pull request ID changed from 36575 to 36532

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Copied to Backport #48095: nautilus: mds: fix file recovery crash after replaying delayed requests added

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Copied to Backport #48096: octopus: mds: fix file recovery crash after replaying delayed requests added

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #46906

mds: fix file recovery crash after replaying delayed requests

Updated by Zhi Zhang over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Nathan Cutler over 3 years ago

Updated by Nathan Cutler over 3 years ago

Updated by Nathan Cutler over 3 years ago