Bug #8758: PGs get stuck in “replay”, but drop it upon osd restarts - Ceph - Ceph

Actions

Copy link

Bug #8758

closed

PGs get stuck in “replay”, but drop it upon osd restarts

Added by Alexandre Oliva almost 10 years ago. Updated about 7 years ago.

Status:

Won't Fix

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Sometimes, after restarting all OSDs (which I often do when I want them to recover quickly, since otherwise my memory-starved servers often start thrashing), some PGs remain in “active+replay+(degraded|clean)” state for a very long time, and (some?) operations directed at them appear to get delayed indefinitely. At times, I waited for hours and there would be no change to this status. This is one issue.
The other issue is that, when I restart any of the OSDs hosting these PGs, the PGs recover successfully, without ever going through a replay state AFAICT. Now, this is a bit concerning: if some operations needed to be replayed before, it can't be the case that, just by restarting an osd, they no longer need to, right? So, is there any reason to believe that ops might have been replayed and just failed to clear the replay status (which causes ops to be delayed), or is there a bug in the post-recovery status setting that causes the need for replaying earlier ops to be dropped after a subsequent recovery?
I've observed this with 0.80.1.

Files

osd-requeue-delayed-replay.patch (2.65 KB) osd-requeue-delayed-replay.patch

Pach that fixes “PGs stuck in replay”

Alexandre Oliva, 07/27/2014 01:44 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #8758

PGs get stuck in “replay”, but drop it upon osd restarts

Updated by Sage Weil almost 10 years ago

Updated by Greg Farnum almost 10 years ago

Updated by Alexandre Oliva almost 10 years ago

Updated by Greg Farnum almost 10 years ago

Updated by Alexandre Oliva almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Alexandre Oliva almost 10 years ago

Updated by Alexandre Oliva almost 10 years ago

Updated by Wenjun Huang almost 9 years ago

Updated by Greg Farnum almost 9 years ago

Updated by Sage Weil about 7 years ago