Project

General

Profile

Actions

Bug #7368

closed

ceph osd repair * blocks after some minutes and prevent other ceph pg repair commands

Added by Yann Dupont about 10 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,
this is a follow up of http://tracker.ceph.com/issues/7367

An unfortunate update To 0.75 endend with lots (~3000) of pg flagged inconsistent.
As iterating over the inconsistent pg is slow, I tried the ceph osd repair * command.

At first it works fine and lots of pg are fixed... After some minutes the rate of fixed pg decrease quite steadily to finally halt after ~ 10 minutes.

Repeating the command ceph osd repair * works again at the same speed, slow down and halt after approximatively the same time.
I won't say time is exponential because after 10 minutes I don't have any more fixed PG.

This is the first problem.

The second one is that after the ceph osd repair * command halted,
all the ceph pg repair p.xx commands are ignored (maybe they are queued, as the osd seems instructed to do the check, but a process seems stuck somewhere , and prevent the osd to execute the check).

Restarting the osd cure the problem and ceph pg repair p.xx works again.

Observed with 0.76 & 0.72 too.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #7367: fail to run mds and mount rbd (v0.76)Closed02/07/2014

Actions
Actions #1

Updated by Ian Colle about 10 years ago

  • Target version deleted (v0.77)
Actions #2

Updated by Loïc Dachary over 9 years ago

Another mention of things slowing down when repair is almost complete : http://tracker.ceph.com/issues/9566 . Not sure if it is related though.

Actions #3

Updated by Samuel Just over 9 years ago

  • Status changed from New to Can't reproduce
Actions #4

Updated by Yann Dupont over 9 years ago

Loic, If I understand correctly, #9566 is "normal" backfilling, and Sage's explanation is clear. In my case, I had lots of inconsistent PG, and I had to manually repair those PG.

Samuel, did you tried with recent (Firefly or more recent version) or with older version ? Could be well an issue fixed in newer versions.

Right now, this faulty experimental cluster is long done, I can't test anymore, but could be an interesting test to do. But don't know how.

Actions

Also available in: Atom PDF