Project

General

Profile

Actions

Bug #8758

closed

PGs get stuck in “replay”, but drop it upon osd restarts

Added by Alexandre Oliva almost 10 years ago. Updated about 7 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sometimes, after restarting all OSDs (which I often do when I want them to recover quickly, since otherwise my memory-starved servers often start thrashing), some PGs remain in “active+replay+(degraded|clean)” state for a very long time, and (some?) operations directed at them appear to get delayed indefinitely. At times, I waited for hours and there would be no change to this status. This is one issue.
The other issue is that, when I restart any of the OSDs hosting these PGs, the PGs recover successfully, without ever going through a replay state AFAICT. Now, this is a bit concerning: if some operations needed to be replayed before, it can't be the case that, just by restarting an osd, they no longer need to, right? So, is there any reason to believe that ops might have been replayed and just failed to clear the replay status (which causes ops to be delayed), or is there a bug in the post-recovery status setting that causes the need for replaying earlier ops to be dropped after a subsequent recovery?
I've observed this with 0.80.1.


Files

osd-requeue-delayed-replay.patch (2.65 KB) osd-requeue-delayed-replay.patch Pach that fixes “PGs stuck in replay” Alexandre Oliva, 07/27/2014 01:44 PM
Actions

Also available in: Atom PDF