Project

General

Profile

Actions

Bug #22043

closed

rados/singleton/all/recovery-preemption.yaml fails with "egrep \'"\'"\'(defer backfill|defer recovery)\'"\'"\' /var/log/ceph/ceph-osd.*.log\''"

Added by Kefu Chai over 6 years ago. Updated almost 3 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Actions #1

Updated by Kefu Chai over 6 years ago

  • Priority changed from Normal to High

it's reproduciable.

Actions #2

Updated by Kefu Chai over 6 years ago

  • Description updated (diff)
Actions #3

Updated by Kefu Chai over 6 years ago

i reproduced this issue on master and on 7b1c77a643516c12de4df558f89924b8c0fe45e7 by manually repeating the steps in recovery-preemption.yaml with a vstart cluster with memstore. but after the recovery/backfill finished, none of the backfill/recovery reported being preempt: no (defer backfill|defer recovery) showed up in the osd log.

debug-reserver=10 shows that there are victims get preempted. but following patch does not print any "DeferBackfill" or "DeferRecovery" in the log message. will keep looking

diff --git a/src/osd/PG.cc b/src/osd/PG.cc
index 005333530f..f8523ad5c3 100644
--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -5810,6 +5810,13 @@ void PG::process_peering_event(RecoveryCtx *rctx)

 void PG::queue_peering_event(CephPeeringEvtRef evt)
 {
+
+  if (evt->get_event().dynamic_type() == PG::DeferBackfill::static_type()) {
+    derr << __func__ << "DeferBackfill" << dendl;
+  } else if (evt->get_event().dynamic_type() == PG::DeferRecovery::static_type()) {
+    derr << __func__ << "DeferRecovery" << dendl;
+  }
+
   if (old_peering_evt(evt))
     return;
   peering_queue.push_back(evt);
Actions #4

Updated by Kefu Chai over 6 years ago

  • Status changed from New to Fix Under Review
Actions #5

Updated by Kefu Chai over 6 years ago

  • Backport set to luminous
Actions #8

Updated by Sage Weil over 6 years ago

my wip-22043 helped, see http://pulpito.ceph.com/sage-2017-12-05_21:42:54-rados:singleton-master-distro-basic-mira/

the problem seems to be that the timing of the test isn't reliably triggering backfill when the osd is marked out. i think we need to (1) set pg logs short, (2) do initial long bench run, (3) mark osd out. (now guarnateed to have backfill.) then (4) set pg logs long again, and (5) do the short run to dirty the pgs, then (6) let the down osd recovery (should reliably do log recovery). that's what leads to the preemption message in the log.

Actions #9

Updated by Sage Weil almost 3 years ago

  • Status changed from Fix Under Review to Can't reproduce
Actions

Also available in: Atom PDF