Project

General

Profile

Actions

Bug #11856

closed

osd - scrubbing slot leaked

Added by Guang Yang almost 9 years ago. Updated about 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version: v0.80.4
platform: RHEL6

With our production cluster, we found that the 'pg repair' command was not honored for inconsistent PG, further deep dive showed that it was due to that for one of the OSD within the PG, the scrubbing slot was occupied, even through there is no scrubbing PG in the cluster.

I am thinking of two possibilities:
  1. The unreserve scrubbing request from primary to replica was not processed properly (e.g. the message got lost?)
  2. For cases, the scrubbing reservation was not cleared properly.

Since I don't have much log for the scrubbing reservation/un-reservation, it is hard to tell if it is one or the other.

At a starting point, we might want to expose the scrubbing slot state from admin socket so as to make the troubleshooting easier.

Actions #2

Updated by Guang Yang almost 9 years ago

By going through the code, following seems a potential leak:

When there is map change to the PG, it relies on ReplicatedPG::on_change to release the reservation, and then it calls to scrub_clear_state, which will check if the scrubber is active or not, and if it is, it will decrease the active scrubs, however, it will not change the pending scrubs, which means, if the PG's scrub has not turn from pending to active yet, the pending slot will be leaked..

Actions #3

Updated by Sage Weil almost 9 years ago

  • Priority changed from Normal to High
Actions #5

Updated by Samuel Just about 8 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF