Project

General

Profile

Bug #17954

standby-replay daemons can sometimes miss events

Added by John Spray 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
Start date:
11/18/2016
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Component(FS):
MDS
Needs Doc:
No

Description

The symptom is that a standby replay daemon gives log messages like "waiting for subtree_map. (skipping " at times other than on startup. As a result, the standby replay MDS will end up with incorrect state in its cache (this won't usually be obvious unless it goes active).

I think this can happen due to MDLog::standby_trim_segments removing the still-in-progress segment (whereupon we will ignore all events until we see another subtreemap). standyby_trim_segments can drop a segment if its end position is behind expire_pos, but this is inconsistent with the check we do in MDSRank:_standby_replay_restart, where we only respawn if we fall behind trimmed pos (not expire pos).

Saw this when working on the test for http://tracker.ceph.com/issues/16919 (https://github.com/ceph/ceph-qa-suite/pull/1111) and was finding that the standby was sometimes failing to unlink purged strays, turns out that was because it was just ignoring some events due to this bug. We see this especially if one is calling the "flush journal" asok on the active MDS, because it will trim things much more quickly than it otherwise would.


Related issues

Copied to Backport #18192: jewel: standby-replay daemons can sometimes miss events Resolved

History

#1 Updated by John Spray 5 months ago

  • Status changed from New to Need Review

#2 Updated by John Spray 5 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel

#3 Updated by Loic Dachary 5 months ago

  • Copied to Backport #18192: jewel: standby-replay daemons can sometimes miss events added

#4 Updated by Nathan Cutler 3 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF