Project

General

Profile

Bug #22058

mds: admin socket wait for scrub completion is racy

Added by Patrick Donnelly over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From: http://pulpito.ceph.com/pdonnell-2017-11-06_22:14:57-fs-wip-pdonnell-testing-20171106.200337-testing-basic-smithi/1821087/

The code to wait for the scrub completion is racy:

https://github.com/ceph/ceph/blob/2a848b430162648a470adb3f7d8ae0fa74799f10/src/mds/MDSRank.cc#L2112-L2116

If the scrub completes before the wait, then the admin socket thread forever waits.

The mutex we're using is wrong. We need to implement our own C_SaferCond which takes the mds_lock.


Related issues

Copied to CephFS - Backport #22907: luminous: mds: admin socket wait for scrub completion is racy Resolved

History

#1 Updated by Zheng Yan over 6 years ago

class C_SaferCond : public Context {
  Mutex lock;    ///< Mutex to take
  Cond cond;     ///< Cond to signal
  bool done;     ///< true after finish() has been called
  int rval;      ///< return value
public:
  C_SaferCond() : lock("C_SaferCond"), done(false), rval(0) {}
  void finish(int r) override { complete(r); }

  /// We overload complete in order to not delete the context
  void complete(int r) override {
    Mutex::Locker l(lock);
    done = true;
    rval = r;
    cond.Signal();
  }

  /// Returns rval once the Context is called
  int wait() {
    Mutex::Locker l(lock);
    while (!done)
      cond.Wait(lock);
    return rval;
  }
};

I don't understand why it waits forever. the variable 'done' is true if scrub completes

#2 Updated by Patrick Donnelly over 6 years ago

Hrm, I missed that bit of logic. Yes, I don't know why it waits forever either.

#3 Updated by Zheng Yan over 6 years ago

  • Status changed from New to Need More Info

no log, wait for it to happen again.

#4 Updated by Zheng Yan over 6 years ago

  • Status changed from Need More Info to Fix Under Review
  • Backport deleted (luminous)

#5 Updated by Patrick Donnelly over 6 years ago

  • Status changed from Fix Under Review to Resolved

#6 Updated by Patrick Donnelly about 6 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to luminous

Needs backport as the bug will be introduced by:

https://github.com/ceph/ceph/pull/18858

#7 Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #22907: luminous: mds: admin socket wait for scrub completion is racy added

#8 Updated by Nathan Cutler about 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF