Project

General

Profile

Actions

Bug #47508

open

Multiple read errors cause repeated entry/exit recovery for each error

Added by David Zafman over 3 years ago. Updated over 2 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After looking at https://github.com/ceph/ceph/pull/36989 I realized that after the first read error all the other get saved by block_for_clean(). If backfill and recovery is already going, I assume that adding the information into the pg log will get it handled before leaving backfill or recovery.

This should be tested.

Actions #1

Updated by David Zafman over 3 years ago

--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -15092,8 +15092,8 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
   dout(10) << __func__ << " " << soid
           << " peers osd.{" << get_acting_recovery_backfill() << "}" << dendl;

-  if (!is_clean()) {
-    block_for_clean(soid, op);
+  if (!is_clean() && !is_backfilling() && !is_recovering()) {
+    block_for_clean(soid, op);  // XXX: Fix function name?
     return -EAGAIN;
   }

@@ -15115,9 +15115,7 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
   waiting_for_unreadable_object[soid].push_back(op);
   op->mark_delayed("waiting for missing object");

-  if (!eio_errors_to_process) {
-    eio_errors_to_process = true;
-    ceph_assert(is_clean());
+  if (is_clean()) {
     state_set(PG_STATE_REPAIR);
     state_clear(PG_STATE_CLEAN);
     queue_peering_event(
@@ -15127,6 +15125,8 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
          get_osdmap_epoch(),
          PeeringState::DoRecovery())));
   } else {
+    // Set repair in caes we are the first read error and we happen to be backfilling or recovering
+    state_set(PG_STATE_REPAIR);
     // A prior error must have already cleared clean state and queued recovery
     // or a map change has triggered re-peering.
     // Not inlining the recovery by calling maybe_kick_recovery(soid);---
Actions #2

Updated by David Zafman over 3 years ago

  • Status changed from New to In Progress
Actions #3

Updated by David Zafman over 3 years ago

WIthout this fix every object is a recovery. Only with added 2 dout()s.

2020-09-16T20:27:59.306-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 24'1 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean m=1] rep_repair_primary_object First read error starting recovery for 2:ff7b1f36:::obj1:head
2020-09-16T20:27:59.514-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 mlcod 34'102 active+recovering+repair mbc={255={}}] rep_repair_primary_object Blocked by PG state 2:104778fc:::obj2:head
2020-09-16T20:27:59.534-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 27'2 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean+repair m=1] rep_repair_primary_object First read error starting recovery for 2:104778fc:::obj2:head
2020-09-16T20:27:59.742-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 mlcod 34'102 active+recovering+repair mbc={255={}}] rep_repair_primary_object Blocked by PG state 2:8dd16f86:::obj3:head
2020-09-16T20:27:59.770-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 27'3 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean+repair m=1] rep_repair_primary_object First read error starting recovery for 2:8dd16f86:::obj3:head
Actions #4

Updated by David Zafman over 3 years ago

  • Pull request ID set to 37205
Actions #5

Updated by Neha Ojha over 3 years ago

  • Status changed from In Progress to Fix Under Review
Actions #6

Updated by David Zafman over 3 years ago

  • Status changed from Fix Under Review to In Progress
Actions #7

Updated by Josh Durgin over 2 years ago

  • Assignee deleted (David Zafman)
Actions

Also available in: Atom PDF