Project

General

Profile

Bug #38263

mds: fix potential re-evaluate stray dentry in _unlink_local_finish

Added by Zhi Zhang about 1 month ago. Updated 23 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
Start date:
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

```
2019-01-25 10:08:13.917522 7f882dcca700 -1 /data/build_ceph/ceph-build-luminous/BUILD/ceph-12.2.8-217-gaf1d23f093/src/mds/StrayManager.cc: In function 'bool StrayManager::_eval_stray(CDentry*, bool)' thread 7f882dcca700 time 2019-01-25 10:08:13.915560 /data/build_ceph/ceph-build-luminous/BUILD/ceph-12.2.8-217-gaf1d23f093/src/mds/StrayManager.cc: 421: FAILED assert(!dn->state_test(CDentry::STATE_PURGING))

ceph version 12.2.8-217-gaf1d23f093 (af1d23f093441e0fb7550afff43153bd0bb09e3c) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f883d211f00]
2: (StrayManager::_eval_stray(CDentry*, bool)+0xd13) [0x7f883d045553]
3: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f883d04565e]
4: (Server::_unlink_local_finish(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x393) [0x7f883cf31273]
5: (MDSIOContextBase::complete(int)+0xa4) [0x7f883d15ac44]
6: (MDSLogContextBase::complete(int)+0x3f) [0x7f883d15b06f]
7: (Finisher::finisher_thread_entry()+0x198) [0x7f883d210e08]
8: (()+0x7dc5) [0x7f883acefdc5]
9: (clone()+0x6d) [0x7f8839dd574d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
```

This crash happened on the MDS under very heavy load. The root cause should be like this:
1. MDS handle_client_unlink sends early reply to client.
2. client processes faster and sends cap release to MDS.
3. MDS processes handle_client_cap_release before _unlink_local_finish.
4. MDS processes _unlink_local_finish:
4.1 drops locks and decreases ref in respond_to_request, then triggers eval_stray for the first time.
4.2 calls notify_stray and enters eval_stray for the second time, then crash happens.

Normally _unlink_local_finish will be processed before handle_client_cap_release, so eval_stray will be called only in handle_client_cap_release.


Related issues

Copied to fs - Backport #38335: mimic: mds: fix potential re-evaluate stray dentry in _unlink_local_finish Resolved
Copied to fs - Backport #38336: luminous: mds: fix potential re-evaluate stray dentry in _unlink_local_finish Resolved

History

#1 Updated by Patrick Donnelly about 1 month ago

  • Status changed from New to Need Review
  • Assignee set to Zhi Zhang
  • Target version set to v14.0.0
  • Start date deleted (02/12/2019)
  • Backport set to mimic,luminous
  • Pull request ID set to 26374

#2 Updated by Patrick Donnelly about 1 month ago

  • Status changed from Need Review to Pending Backport

#3 Updated by Nathan Cutler about 1 month ago

  • Copied to Backport #38335: mimic: mds: fix potential re-evaluate stray dentry in _unlink_local_finish added

#4 Updated by Nathan Cutler about 1 month ago

  • Copied to Backport #38336: luminous: mds: fix potential re-evaluate stray dentry in _unlink_local_finish added

#5 Updated by Nathan Cutler 23 days ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF