Bug #7368
closedceph osd repair * blocks after some minutes and prevent other ceph pg repair commands
0%
Description
Hello,
this is a follow up of http://tracker.ceph.com/issues/7367
An unfortunate update To 0.75 endend with lots (~3000) of pg flagged inconsistent.
As iterating over the inconsistent pg is slow, I tried the ceph osd repair * command.
At first it works fine and lots of pg are fixed... After some minutes the rate of fixed pg decrease quite steadily to finally halt after ~ 10 minutes.
Repeating the command ceph osd repair * works again at the same speed, slow down and halt after approximatively the same time.
I won't say time is exponential because after 10 minutes I don't have any more fixed PG.
This is the first problem.
The second one is that after the ceph osd repair * command halted,
all the ceph pg repair p.xx commands are ignored (maybe they are queued, as the osd seems instructed to do the check, but a process seems stuck somewhere , and prevent the osd to execute the check).
Restarting the osd cure the problem and ceph pg repair p.xx works again.
Observed with 0.76 & 0.72 too.
Updated by Loïc Dachary over 9 years ago
Another mention of things slowing down when repair is almost complete : http://tracker.ceph.com/issues/9566 . Not sure if it is related though.
Updated by Samuel Just over 9 years ago
- Status changed from New to Can't reproduce
Updated by Yann Dupont over 9 years ago
Loic, If I understand correctly, #9566 is "normal" backfilling, and Sage's explanation is clear. In my case, I had lots of inconsistent PG, and I had to manually repair those PG.
Samuel, did you tried with recent (Firefly or more recent version) or with older version ? Could be well an issue fixed in newer versions.
Right now, this faulty experimental cluster is long done, I can't test anymore, but could be an interesting test to do. But don't know how.