Bug #47508
open
Multiple read errors cause repeated entry/exit recovery for each error
Added by David Zafman over 3 years ago.
Updated over 2 years ago.
Description
After looking at https://github.com/ceph/ceph/pull/36989 I realized that after the first read error all the other get saved by block_for_clean(). If backfill and recovery is already going, I assume that adding the information into the pg log will get it handled before leaving backfill or recovery.
This should be tested.
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -15092,8 +15092,8 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
dout(10) << __func__ << " " << soid
<< " peers osd.{" << get_acting_recovery_backfill() << "}" << dendl;
- if (!is_clean()) {
- block_for_clean(soid, op);
+ if (!is_clean() && !is_backfilling() && !is_recovering()) {
+ block_for_clean(soid, op); // XXX: Fix function name?
return -EAGAIN;
}
@@ -15115,9 +15115,7 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
waiting_for_unreadable_object[soid].push_back(op);
op->mark_delayed("waiting for missing object");
- if (!eio_errors_to_process) {
- eio_errors_to_process = true;
- ceph_assert(is_clean());
+ if (is_clean()) {
state_set(PG_STATE_REPAIR);
state_clear(PG_STATE_CLEAN);
queue_peering_event(
@@ -15127,6 +15125,8 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct
get_osdmap_epoch(),
PeeringState::DoRecovery())));
} else {
+ // Set repair in caes we are the first read error and we happen to be backfilling or recovering
+ state_set(PG_STATE_REPAIR);
// A prior error must have already cleared clean state and queued recovery
// or a map change has triggered re-peering.
// Not inlining the recovery by calling maybe_kick_recovery(soid);---
- Status changed from New to In Progress
WIthout this fix every object is a recovery. Only with added 2 dout()s.
2020-09-16T20:27:59.306-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 24'1 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean m=1] rep_repair_primary_object First read error starting recovery for 2:ff7b1f36:::obj1:head
2020-09-16T20:27:59.514-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 mlcod 34'102 active+recovering+repair mbc={255={}}] rep_repair_primary_object Blocked by PG state 2:104778fc:::obj2:head
2020-09-16T20:27:59.534-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 27'2 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean+repair m=1] rep_repair_primary_object First read error starting recovery for 2:104778fc:::obj2:head
2020-09-16T20:27:59.742-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 mlcod 34'102 active+recovering+repair mbc={255={}}] rep_repair_primary_object Blocked by PG state 2:8dd16f86:::obj3:head
2020-09-16T20:27:59.770-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 27'3 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean+repair m=1] rep_repair_primary_object First read error starting recovery for 2:8dd16f86:::obj3:head
- Pull request ID set to 37205
- Status changed from In Progress to Fix Under Review
- Status changed from Fix Under Review to In Progress
- Assignee deleted (
David Zafman)
Also available in: Atom
PDF