Bug #11856
closedosd - scrubbing slot leaked
0%
Description
ceph version: v0.80.4
platform: RHEL6
With our production cluster, we found that the 'pg repair' command was not honored for inconsistent PG, further deep dive showed that it was due to that for one of the OSD within the PG, the scrubbing slot was occupied, even through there is no scrubbing PG in the cluster.
I am thinking of two possibilities:- The unreserve scrubbing request from primary to replica was not processed properly (e.g. the message got lost?)
- For cases, the scrubbing reservation was not cleared properly.
Since I don't have much log for the scrubbing reservation/un-reservation, it is hard to tell if it is one or the other.
At a starting point, we might want to expose the scrubbing slot state from admin socket so as to make the troubleshooting easier.
Updated by Guang Yang almost 9 years ago
By going through the code, following seems a potential leak:
When there is map change to the PG, it relies on ReplicatedPG::on_change to release the reservation, and then it calls to scrub_clear_state, which will check if the scrubber is active or not, and if it is, it will decrease the active scrubs, however, it will not change the pending scrubs, which means, if the PG's scrub has not turn from pending to active yet, the pending slot will be leaked..
Updated by Guang Yang almost 9 years ago
Updated by Samuel Just about 8 years ago
- Status changed from New to Can't reproduce