Actions
Bug #47508
openMultiple read errors cause repeated entry/exit recovery for each error
Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After looking at https://github.com/ceph/ceph/pull/36989 I realized that after the first read error all the other get saved by block_for_clean(). If backfill and recovery is already going, I assume that adding the information into the pg log will get it handled before leaving backfill or recovery.
This should be tested.
Updated by David Zafman over 3 years ago
--- a/src/osd/PrimaryLogPG.cc +++ b/src/osd/PrimaryLogPG.cc @@ -15092,8 +15092,8 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct dout(10) << __func__ << " " << soid << " peers osd.{" << get_acting_recovery_backfill() << "}" << dendl; - if (!is_clean()) { - block_for_clean(soid, op); + if (!is_clean() && !is_backfilling() && !is_recovering()) { + block_for_clean(soid, op); // XXX: Fix function name? return -EAGAIN; } @@ -15115,9 +15115,7 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct waiting_for_unreadable_object[soid].push_back(op); op->mark_delayed("waiting for missing object"); - if (!eio_errors_to_process) { - eio_errors_to_process = true; - ceph_assert(is_clean()); + if (is_clean()) { state_set(PG_STATE_REPAIR); state_clear(PG_STATE_CLEAN); queue_peering_event( @@ -15127,6 +15125,8 @@ int PrimaryLogPG::rep_repair_primary_object(const hobject_t& soid, OpContext *ct get_osdmap_epoch(), PeeringState::DoRecovery()))); } else { + // Set repair in caes we are the first read error and we happen to be backfilling or recovering + state_set(PG_STATE_REPAIR); // A prior error must have already cleared clean state and queued recovery // or a map change has triggered re-peering. // Not inlining the recovery by calling maybe_kick_recovery(soid);---
Updated by David Zafman over 3 years ago
WIthout this fix every object is a recovery. Only with added 2 dout()s.
2020-09-16T20:27:59.306-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 24'1 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean m=1] rep_repair_primary_object First read error starting recovery for 2:ff7b1f36:::obj1:head 2020-09-16T20:27:59.514-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 mlcod 34'102 active+recovering+repair mbc={255={}}] rep_repair_primary_object Blocked by PG state 2:104778fc:::obj2:head 2020-09-16T20:27:59.534-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 27'2 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean+repair m=1] rep_repair_primary_object First read error starting recovery for 2:104778fc:::obj2:head 2020-09-16T20:27:59.742-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 mlcod 34'102 active+recovering+repair mbc={255={}}] rep_repair_primary_object Blocked by PG state 2:8dd16f86:::obj3:head 2020-09-16T20:27:59.770-0700 7f285c50d700 20 osd.2 pg_epoch: 34 pg[2.0( v 34'102 lc 27'3 (0'0,34'102] local-lis/les=33/34 n=100 ec=22/22 lis/c=33/33 les/c/f=34/34/0 sis=33) [2,1,0] r=0 lpr=33 crt=34'102 lcod 34'102 mlcod 34'102 active+clean+repair m=1] rep_repair_primary_object First read error starting recovery for 2:8dd16f86:::obj3:head
Updated by Neha Ojha over 3 years ago
- Status changed from In Progress to Fix Under Review
Updated by David Zafman over 3 years ago
- Status changed from Fix Under Review to In Progress
Actions