Project

General

Profile

Actions

Bug #17954

closed

standby-replay daemons can sometimes miss events

Added by John Spray over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The symptom is that a standby replay daemon gives log messages like "waiting for subtree_map. (skipping " at times other than on startup. As a result, the standby replay MDS will end up with incorrect state in its cache (this won't usually be obvious unless it goes active).

I think this can happen due to MDLog::standby_trim_segments removing the still-in-progress segment (whereupon we will ignore all events until we see another subtreemap). standyby_trim_segments can drop a segment if its end position is behind expire_pos, but this is inconsistent with the check we do in MDSRank:_standby_replay_restart, where we only respawn if we fall behind trimmed pos (not expire pos).

Saw this when working on the test for http://tracker.ceph.com/issues/16919 (https://github.com/ceph/ceph-qa-suite/pull/1111) and was finding that the standby was sometimes failing to unlink purged strays, turns out that was because it was just ignoring some events due to this bug. We see this especially if one is calling the "flush journal" asok on the active MDS, because it will trim things much more quickly than it otherwise would.


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #18192: jewel: standby-replay daemons can sometimes miss eventsResolvedNathan CutlerActions
Actions #1

Updated by John Spray over 7 years ago

  • Status changed from New to Fix Under Review
Actions #2

Updated by John Spray over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to jewel
Actions #3

Updated by Loïc Dachary over 7 years ago

  • Copied to Backport #18192: jewel: standby-replay daemons can sometimes miss events added
Actions #4

Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF