Project

General

Profile

Actions

Bug #38263

closed

mds: fix potential re-evaluate stray dentry in _unlink_local_finish

Added by Zhi Zhang about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

```
2019-01-25 10:08:13.917522 7f882dcca700 1 /data/build_ceph/ceph-build-luminous/BUILD/ceph-12.2.8-217-gaf1d23f093/src/mds/StrayManager.cc: In function 'bool StrayManager::_eval_stray(CDentry*, bool)' thread 7f882dcca700 time 2019-01-25 10:08:13.915560 /data/build_ceph/ceph-build-luminous/BUILD/ceph-12.2.8-217-gaf1d23f093/src/mds/StrayManager.cc: 421: FAILED assert(!dn>state_test(CDentry::STATE_PURGING))

ceph version 12.2.8-217-gaf1d23f093 (af1d23f093441e0fb7550afff43153bd0bb09e3c) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f883d211f00]
2: (StrayManager::_eval_stray(CDentry*, bool)+0xd13) [0x7f883d045553]
3: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f883d04565e]
4: (Server::_unlink_local_finish(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x393) [0x7f883cf31273]
5: (MDSIOContextBase::complete(int)+0xa4) [0x7f883d15ac44]
6: (MDSLogContextBase::complete(int)+0x3f) [0x7f883d15b06f]
7: (Finisher::finisher_thread_entry()+0x198) [0x7f883d210e08]
8: (()+0x7dc5) [0x7f883acefdc5]
9: (clone()+0x6d) [0x7f8839dd574d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
```

This crash happened on the MDS under very heavy load. The root cause should be like this:
1. MDS handle_client_unlink sends early reply to client.
2. client processes faster and sends cap release to MDS.
3. MDS processes handle_client_cap_release before _unlink_local_finish.
4. MDS processes _unlink_local_finish:
4.1 drops locks and decreases ref in respond_to_request, then triggers eval_stray for the first time.
4.2 calls notify_stray and enters eval_stray for the second time, then crash happens.

Normally _unlink_local_finish will be processed before handle_client_cap_release, so eval_stray will be called only in handle_client_cap_release.


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #38335: mimic: mds: fix potential re-evaluate stray dentry in _unlink_local_finishResolvedPrashant DActions
Copied to CephFS - Backport #38336: luminous: mds: fix potential re-evaluate stray dentry in _unlink_local_finishResolvedPrashant DActions
Actions #1

Updated by Patrick Donnelly about 5 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Zhi Zhang
  • Target version set to v14.0.0
  • Start date deleted (02/12/2019)
  • Backport set to mimic,luminous
  • Pull request ID set to 26374
Actions #2

Updated by Patrick Donnelly about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #3

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #38335: mimic: mds: fix potential re-evaluate stray dentry in _unlink_local_finish added
Actions #4

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #38336: luminous: mds: fix potential re-evaluate stray dentry in _unlink_local_finish added
Actions #5

Updated by Nathan Cutler about 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF