Project

General

Profile

Actions

Bug #12722

closed

osd: OSD::do_mon_report stuck at acquiring osd_lock (more than 10mins) and cause OSD marked down

Added by Guang Yang over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Copy the discussion from mailing list as following:

----------------------------------------
> Date: Tue, 18 Aug 2015 08:42:04 -0700
> Subject: Re: OSD::do_mon_report - do we need holding osd_lock
> From: sjust@redhat.com
> To: yguang11@outlook.com
> CC: ceph-devel@vger.kernel.org
>
> Probably! A quick glance at do_mon_report doesn't seem to turn up
> anything I'd expect to be really hard to refactor. You do need to
> break out the required data (into OSDService, I'd think) so that the
> lock is not necessary.
> -Sam
>
> On Mon, Aug 17, 2015 at 6:10 PM, GuangYang <yguang11@outlook.com> wrote:
> > Hi Sam,
> > Today I noticed a scenario that monitor marked OSD down since it did not receive the PG stats from the OSD, further investigation showed that the reason why OSD didn't report stats because it failed to acquire the osd_lock, what happened was:
> > 1. one PG is undergoing long-run peering (search for missing objects)
> > 2. An OP hold the osd_lock and try to acquire the PG lock, which is being held by 1).
> > 3. OSD tick thread failed to acquire osd_lock and stuck for 10 minutes, thus failed to update to monitor its stats
> > 4. monitor mark it down
> >
> > After looking at the code, we found several assertions (that osd_lock should be held) around OSD::do_mon_report, is that required? Any chance to overcome the problem described above by refactoring the locking there?
> >
> > Thanks,
> > Guang
Actions #1

Updated by Sage Weil over 8 years ago

  • Status changed from New to 12
  • Priority changed from Normal to High
Actions #2

Updated by Guang Yang over 8 years ago

Sorry for the delayed work on this one, I will make a PR next week.

Actions #3

Updated by Guang Yang over 8 years ago

  • Status changed from 12 to Fix Under Review
Actions #4

Updated by Loïc Dachary over 8 years ago

Which PR is it ?

Actions #5

Updated by Sage Weil over 8 years ago

  • Status changed from Fix Under Review to Resolved

commit:2074be2db7eebd420faaf15fa9d65ff1f6a7bf4d

Actions

Also available in: Atom PDF