Bug #47508: Multiple read errors cause repeated entry/exit recovery for each error - RADOS - Ceph

Actions

Copy link

Bug #47508

open

Multiple read errors cause repeated entry/exit recovery for each error

Added by David Zafman over 3 years ago. Updated over 2 years ago.

Status:

In Progress

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

37205

Crash signature (v1):

Crash signature (v2):

Description

After looking at https://github.com/ceph/ceph/pull/36989 I realized that after the first read error all the other get saved by block_for_clean(). If backfill and recovery is already going, I assume that adding the information into the pg log will get it handled before leaving backfill or recovery.

This should be tested.

Actions

Copy link

Updated by David Zafman over 3 years ago

--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -15092,8 +15092,8 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
   dout(10) << __func__ << " " << soid
           << " peers osd.{" << get_acting_recovery_backfill() << "}" << dendl;

-  if (!is_clean()) {
-    block_for_clean(soid, op);
+  if (!is_clean() && !is_backfilling() && !is_recovering()) {
+    block_for_clean(soid, op);  // XXX: Fix function name?
     return -EAGAIN;
   }

@@ -15115,9 +15115,7 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
   waiting_for_unreadable_object[soid].push_back(op);
   op->mark_delayed("waiting for missing object");

-  if (!eio_errors_to_process) {
-    eio_errors_to_process = true;
-    ceph_assert(is_clean());
+  if (is_clean()) {
     state_set(PG_STATE_REPAIR);
     state_clear(PG_STATE_CLEAN);
     queue_peering_event(
@@ -15127,6 +15125,8 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
          get_osdmap_epoch(),
          PeeringState::DoRecovery())));
   } else {
+    // Set repair in caes we are the first read error and we happen to be backfilling or recovering
+    state_set(PG_STATE_REPAIR);
     // A prior error must have already cleared clean state and queued recovery
     // or a map change has triggered re-peering.
     // Not inlining the recovery by calling maybe_kick_recovery(soid);---

Actions

Copy link

Updated by David Zafman over 3 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by David Zafman over 3 years ago

WIthout this fix every object is a recovery. Only with added 2 dout()s.

2020-09-16T20:27:59.306-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 24'1 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean m=1] rep_repair_primary_object First read error starting recovery for 2:ff7b1f36:::obj1:head
2020-09-16T20:27:59.514-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 mlcod 34'102 active+recovering+repair mbc={255={}}] rep_repair_primary_object Blocked by PG state 2:104778fc:::obj2:head
2020-09-16T20:27:59.534-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 27'2 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean+repair m=1] rep_repair_primary_object First read error starting recovery for 2:104778fc:::obj2:head
2020-09-16T20:27:59.742-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 mlcod 34'102 active+recovering+repair mbc={255={}}] rep_repair_primary_object Blocked by PG state 2:8dd16f86:::obj3:head
2020-09-16T20:27:59.770-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 27'3 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean+repair m=1] rep_repair_primary_object First read error starting recovery for 2:8dd16f86:::obj3:head

Actions

Copy link