Bug #11856: osd - scrubbing slot leaked - Ceph - Ceph

Actions

Copy link

Bug #11856

closed

osd - scrubbing slot leaked

Added by Guang Yang almost 9 years ago. Updated about 8 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version: v0.80.4
platform: RHEL6

With our production cluster, we found that the 'pg repair' command was not honored for inconsistent PG, further deep dive showed that it was due to that for one of the OSD within the PG, the scrubbing slot was occupied, even through there is no scrubbing PG in the cluster.

I am thinking of two possibilities:

The unreserve scrubbing request from primary to replica was not processed properly (e.g. the message got lost?)
For cases, the scrubbing reservation was not cleared properly.

Since I don't have much log for the scrubbing reservation/un-reservation, it is hard to tell if it is one or the other.

At a starting point, we might want to expose the scrubbing slot state from admin socket so as to make the troubleshooting easier.

Actions

Copy link

Updated by Guang Yang almost 9 years ago

https://github.com/ceph/ceph/pull/4849

Actions

Copy link

Updated by Guang Yang almost 9 years ago

By going through the code, following seems a potential leak:

When there is map change to the PG, it relies on ReplicatedPG::on_change to release the reservation, and then it calls to scrub_clear_state, which will check if the scrubber is active or not, and if it is, it will decrease the active scrubs, however, it will not change the pending scrubs, which means, if the PG's scrub has not turn from pending to active yet, the pending slot will be leaked..

Actions

Copy link