Project

General

Profile

Bug #24101

mds: deadlock during fsstress workunit with 9 actives

Added by Patrick Donnelly almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.com/pdonnell-2018-05-11_00:47:01-multimds-wip-pdonnell-testing-20180510.225359-testing-basic-smithi/2520224/

Op client was blocked on:

pdonnell@smithi104:/var/log/ceph$ sudo ceph --admin-daemon=/var/run/ceph/ceph-client.0.11792.asok mds_requests
{
    "request": {
        "tid": 11153,
        "op": "lookup",
        "path": "#0x10000000107/f31",
        "path2": "",
        "ino": "0x10000000107",
        "hint_ino": "0x0",
        "sent_stamp": "2018-05-11 12:41:53.192579",
        "mds": 5,
        "resend_mds": -1,
        "send_to_auth": 0,
        "sent_on_mseq": 1,
        "retry_attempt": 2,
        "got_unsafe": 0,
        "uid": 0,
        "gid": 0,
        "oldest_client_tid": 11118,
        "mdsmap_epoch": 0,
        "flags": 0,
        "num_retry": 0,
        "num_fwd": 0,
        "num_releases": 0,
        "abort_rc": 0
    }   
}   

rank 5 was waiting on a rdlock from rank 0. rank 0 was waiting on capability release from the single client.

Just to see what would happen, I tried failing mds.b (rank 5). That didn't do anything so I tried failing mds.i (rank 0) which mds.b was waiting for a rdlock on. Also didn't do anything.

Finally, I evicted the client so that we would get logs from the run.


Related issues

Related to CephFS - Bug #23837: client: deleted inode's Bufferhead which was in STATE::Tx would lead a assert fail Resolved

History

#2 Updated by Zheng Yan almost 6 years ago

  • Status changed from New to 12

#3 Updated by Patrick Donnelly almost 6 years ago

  • Assignee set to Zheng Yan

#4 Updated by Patrick Donnelly almost 6 years ago

  • Related to Bug #23837: client: deleted inode's Bufferhead which was in STATE::Tx would lead a assert fail added

#5 Updated by Ivan Guan almost 6 years ago

Zheng Yan wrote:

caused by https://github.com/ceph/ceph/pull/21615

Sorry,i don't understand why this pr can cause 24101.Could you explain it in detail?

#6 Updated by Zheng Yan almost 6 years ago

Ivan Guan wrote:

Zheng Yan wrote:

caused by https://github.com/ceph/ceph/pull/21615

Sorry,i don't understand why this pr can cause 24101.Could you explain it in detail?

void ObjectCacher::discard_writeback(ObjectSet *oset,
                                     const vector<ObjectExtent>& exls,
                                     Context* on_finish)
{
  assert(lock.is_locked());
  bool was_dirty = oset->dirty_or_tx > 0;

  C_GatherBuilder gather(cct);
  _discard(oset, exls, &gather);

  if (gather.has_subs()) {
    bool flushed = was_dirty && oset->dirty_or_tx == 0;
    gather.set_finisher(new FunctionContext(
      [this, oset, flushed, on_finish](int) {
        assert(lock.is_locked());
        if (flushed && flush_set_callback)
          flush_set_callback(flush_set_callback_arg, oset);
        if (on_finish)
          on_finish->complete(0);
      }));
    gather.activate();
    return;
  }

  _discard_finish(oset, was_dirty, on_finish);
}

"oset->dirty_or_tx > 0" is true before _discard() get called. After _discard() return, oset->dirty_or_tx become 0 and gather.has_subs() is true. flush_set_callback() doesn't get called in this case, which causes leakage of Fcb caps

#7 Updated by Patrick Donnelly almost 6 years ago

  • Status changed from 12 to Closed

Apparently resolved by the revert.

Also available in: Atom PDF