Project

General

Profile

Actions

Bug #12722

closed

osd: OSD::do_mon_report stuck at acquiring osd_lock (more than 10mins) and cause OSD marked down

Added by Guang Yang over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Copy the discussion from mailing list as following:

----------------------------------------
> Date: Tue, 18 Aug 2015 08:42:04 -0700
> Subject: Re: OSD::do_mon_report - do we need holding osd_lock
> From: sjust@redhat.com
> To: yguang11@outlook.com
> CC: ceph-devel@vger.kernel.org
>
> Probably! A quick glance at do_mon_report doesn't seem to turn up
> anything I'd expect to be really hard to refactor. You do need to
> break out the required data (into OSDService, I'd think) so that the
> lock is not necessary.
> -Sam
>
> On Mon, Aug 17, 2015 at 6:10 PM, GuangYang <yguang11@outlook.com> wrote:
> > Hi Sam,
> > Today I noticed a scenario that monitor marked OSD down since it did not receive the PG stats from the OSD, further investigation showed that the reason why OSD didn't report stats because it failed to acquire the osd_lock, what happened was:
> > 1. one PG is undergoing long-run peering (search for missing objects)
> > 2. An OP hold the osd_lock and try to acquire the PG lock, which is being held by 1).
> > 3. OSD tick thread failed to acquire osd_lock and stuck for 10 minutes, thus failed to update to monitor its stats
> > 4. monitor mark it down
> >
> > After looking at the code, we found several assertions (that osd_lock should be held) around OSD::do_mon_report, is that required? Any chance to overcome the problem described above by refactoring the locking there?
> >
> > Thanks,
> > Guang
Actions

Also available in: Atom PDF