Project

General

Profile

Actions

Bug #8758

closed

PGs get stuck in “replay”, but drop it upon osd restarts

Added by Alexandre Oliva almost 10 years ago. Updated about 7 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sometimes, after restarting all OSDs (which I often do when I want them to recover quickly, since otherwise my memory-starved servers often start thrashing), some PGs remain in “active+replay+(degraded|clean)” state for a very long time, and (some?) operations directed at them appear to get delayed indefinitely. At times, I waited for hours and there would be no change to this status. This is one issue.
The other issue is that, when I restart any of the OSDs hosting these PGs, the PGs recover successfully, without ever going through a replay state AFAICT. Now, this is a bit concerning: if some operations needed to be replayed before, it can't be the case that, just by restarting an osd, they no longer need to, right? So, is there any reason to believe that ops might have been replayed and just failed to clear the replay status (which causes ops to be delayed), or is there a bug in the post-recovery status setting that causes the need for replaying earlier ops to be dropped after a subsequent recovery?
I've observed this with 0.80.1.


Files

osd-requeue-delayed-replay.patch (2.65 KB) osd-requeue-delayed-replay.patch Pach that fixes “PGs stuck in replay” Alexandre Oliva, 07/27/2014 01:44 PM
Actions #1

Updated by Sage Weil almost 10 years ago

  • Status changed from New to Need More Info

Two things:

1. When the PG gets stuck in replay next time, can you do a 'ceph pg <pgid> query' and see if the OSD also has it in that state? It may be that the OSD just isn't refreshing the PG state after the replay period expires.

2. The conditions under which we go into the replay state are non-trivial. If I remember correctly, it is only if all OSDs in the previous interval go down before the PG is able to repeer, or something along those lines. Restarting a single OSD is generally not enough to trigger it because the surviving OSDs will still have the unstable writes (and their ordering).

Actions #2

Updated by Greg Farnum almost 10 years ago

That's a good summary; see PG::may_need_replay(). I glanced over this and it looks like the PG is placed on the OSD's "replay_queue" when it activates (which is when it gets put in the REPLAY state), and the replay_queue is processed every time the OSD ticks by calling replay_queued_ops() on any PGs which have finished that state.
And replay_queued_ops() calls publish_stats_to_osd(), which sets the stats up to get published to the Monitor. Which nixes my first thought, that the replay PGs simply weren't exiting that state and telling the monitors about it if there were no incoming ops. :/
But there still could be something more subtle going on.

Actions #3

Updated by Alexandre Oliva almost 10 years ago

1. will do
2. my concern is that the replay bit appears to be lost because of a restart. say, osds 0, 1 and 2 (all holding PG 0.0) are restarted. they repeer, and the status of 0.0 gets stuck in active+replay+degraded. Then, I restart osd2, and after repeering, it gets all the way to active+degraded, without having ever given any indication that the replay required after the prior recovery was done. Indeed, it matters little whether it actually did or did not do the replay: it doesn't look like the complex conditions that trigger a replay would carry over an earlier and still pending need for replay onto the subsequent recovery.

Actions #4

Updated by Greg Farnum almost 10 years ago

Hummm, it looks like you're right: we start a replay period (which is time-based, 30s by default) as we go active and fill in info.history.last_epoch_started...which means that if the primary goes away during the replay period, the next one (possibly the same) won't go back to replay when it finishes the next peering round thanks to that check.
At least, if I'm reading all the interval checks correctly in this function. Which I might not be.

Actions #5

Updated by Alexandre Oliva almost 10 years ago

ceph pg <PGid> query shows PGs stuck in active+replay+degraded state have that set in their primary state. Replicas, OTOH, are shown as either peering (still?), or as active+degraded. Some PGs had one replica in each state. Is this the info you were looking for? What should I look for otherwise?

Actions #6

Updated by Sage Weil almost 10 years ago

  • Priority changed from Normal to High
  • Source changed from other to Community (dev)
Actions #7

Updated by Alexandre Oliva almost 10 years ago

Here's a patch that addresses the “stuck in replay” problem (but not the “replay is dropped after osd re-peering” one).

Actions #8

Updated by Alexandre Oliva almost 10 years ago

As for the issue of losing replay states upon member osd restarts... Could the fix be as simple as not setting interval.maybe_went_rw when the state of the PG has the REPLAY bit set? Or do we need something far more elaborate, to account for the fact that we clear the REPLAY bit before we even get started replaying the messages, and we might propagate this state before all messages are fully processed?

Actions #9

Updated by Wenjun Huang almost 9 years ago

Hi

Is this Fix merged to the master code base? why I still found the active+replay+clean state in my cluster with the version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)?And I found the fix is also not backported to the firefly and dumpling version.

Thanks!

Actions #10

Updated by Greg Farnum almost 9 years ago

  • Category set to OSD
  • Status changed from Need More Info to 12
  • Regression set to No

This has a patch fixing at least one half, not sure why it got stuck in Need More Info.

Actions #11

Updated by Sage Weil about 7 years ago

  • Status changed from 12 to Won't Fix
Actions

Also available in: Atom PDF