Actions
Feature #13120
openosd: prioritize more degraded PGs for recovery by considering the missing_loc of the PG
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:
Description
Copy the email as below:
>> Date: Wed, 16 Sep 2015 09:53:34 -0700 >> Subject: Re: Priority backfill/recovery based on the degradation level of the PG >> From: sjust@redhat.com >> To: yguang11@outlook.com >> >> Right, when we request the reservation from our peers (and from >> ourself), we can make the priority dependent on the objects actually >> degraded in missing_loc. We can also, later, use missing_loc to >> determine the order in which to recover the objects. I don't see what >> considering the history gets us that we can't get from simply >> examining missing_loc. >> -Sam >> >> On Wed, Sep 16, 2015 at 9:41 AM, GuangYang <yguang11@outlook.com> wrote: >>> Thanks Sam, my comments inline.. >>> >>> ---------------------------------------- >>>> Date: Wed, 16 Sep 2015 09:29:08 -0700 >>>> Subject: Re: Priority backfill/recovery based on the degradation level of the PG >>>> From: sjust@redhat.com >>>> To: yguang11@outlook.com >>>> >>>> From Aug 26: >>>> >>>> Aug 26 10:14:47 <yguang11> but I think it might make more sense >>>> to calculate the priority based on how real degrade on objects level? >>>> For example, during peering, we check each interval to get a sense of >>>> how degraded for each interval ( pg_interval_t::check_new_interval). >>>> Aug 26 10:15:09 <sjusthm> yguang11: I don't understand >>>> Aug 26 10:15:26 * ngoswami has quit (Quit: Leaving) >>>> Aug 26 10:15:33 <yguang11> For example, we have 8 + 3 as EC profile >>>> Aug 26 10:16:02 <yguang11> and the PG went through interval 1, 2, >>>> 3, during interval 1, it has 8 acting OSDs, interval 2, it has 9 >>>> Aug 26 10:16:46 <yguang11> interval 3, it has 10, and at some >>>> point after interval 3, it starts backfilling, at which point,it has >>>> 10 acting OSDs >>>> Aug 26 10:17:03 <yguang11> but actually, the objects written at >>>> interval 1 are more degraded, properly than others? >>>> Aug 26 10:17:14 <sjusthm> those objects are recovery >>>> Aug 26 10:17:14 * sankarshan has quit (Quit: Are you sure you >>>> want to quit this channel (Cancel/Ok) ?) >>>> Aug 26 10:17:20 <sjusthm> and that happens at a higher priority iirc >>>> Aug 26 10:17:26 <sjusthm> log based recovery happens before backfill >>>> Aug 26 10:17:30 <yguang11> right >>>> Aug 26 10:17:39 <sjusthm> we do those first because they block writes >>>> Aug 26 10:17:44 <sjusthm> backfill doesn't block writes >>>> Aug 26 10:17:58 <sjusthm> (except briefly while the object is >>>> actually being recovered) >>>> Aug 26 10:18:27 <sjusthm> yguang11: yeah, it could do that >>>> Aug 26 10:18:33 <sjusthm> currently recovery happens in version >>>> order more or less >>>> Aug 26 10:18:45 <sjusthm> unless an operation forces an object >>>> to be recovered first >>>> Aug 26 10:18:55 * swami1 has quit (Ping timeout: 480 seconds) >>>> Aug 26 10:19:05 <yguang11> right, my concern is, when consider >>>> *degraded* on those patches, they only check the *current* state of >>>> the PG >>>> Aug 26 10:19:05 <sjusthm> but since there are never more than >>>> <log length> missing or degraded objects, it doesn't seem like an >>>> important optimization >>>> Aug 26 10:19:18 <yguang11> rather than a historic view of the >>>> itervals it went throught >>>> Aug 26 10:19:29 <sjusthm> but only the current state matters? >>>> Aug 26 10:19:40 <sjusthm> I don't understand the distinction >>>> Aug 26 10:19:52 <sjusthm> we know which objects need to be >>>> recovered in log based recovery >>>> Aug 26 10:19:58 <sjusthm> we also know how degraded each is (missing_loc) >>>> Aug 26 10:20:07 <sjusthm> so we could recover the ones with >>>> fewer copies first >>>> Aug 26 10:20:11 <sjusthm> is that what you mean? >>>> Aug 26 10:20:18 <sjusthm> but this is all separate from backfill >>>> Aug 26 10:20:23 <sage> one object might be missing 2 replicas (from >>>> an ealier interval that never fully recovered), but the current >>>> interval is only down by 1 replica. >>>> Aug 26 10:20:27 <sjusthm> which doesn't happen until log based >>>> recovery is done at the moment >>>> Aug 26 10:20:42 <sjusthm> sage: yes, but we don't need to look >>>> at the intervals for that >>>> Aug 26 10:20:48 <sjusthm> missing_loc would tell us how many >>>> copies we have >>>> Aug 26 10:20:52 <sage> yeah >>>> Aug 26 10:20:59 <sjusthm> yguang11: is that what you mean? >>>> Aug 26 10:21:43 <yguang11> Does the priority consider the missing_loc? >>>> Aug 26 10:21:49 <sjusthm> no, but it could >>>> >>>> I think the conclusion was that the history would be irrelevant for >>>> backfill, and would only matter for log based recovery. We already >>>> prioritize that case over backfill, and missing_loc tells us exactly >>>> how degraded each object is. I don't see how the history helps us -- >>>> we could simply use missing_loc directly to determine the priority >>>> when requesting recovery reservations without bothering to look at the >>>> past intervals. >>> As I understand it, the missing_loc only handle the priority for objects in a PG to recovery, given the fact that that PG is already chosen for recovery. One concern I have (I might be wrong here) is that when we choose which PG to recovery, that decision did not consider a history of intervals but only check the acting set of the PG for time being? >>> >>>> -Sam >>>> >>>> On Tue, Sep 15, 2015 at 5:26 PM, GuangYang <yguang11@outlook.com> wrote: >>>>> Thanks Sam. >>>>> >>>>> For recovering, there are actually two sets of priorities: >>>>> 1. Choose which PG to do recovery first >>>>> 2. Once the PG is chosen, pick which object to start with >>>>> >>>>> Our discussion last time was mostly around the second one, while my concerns in this thread is mostly the first one. >>>>> >>>>> Thanks, >>>>> Guang >>>>> >>>>> ---------------------------------------- >>>>>> Date: Tue, 15 Sep 2015 10:33:14 -0700 >>>>>> Subject: Re: Priority backfill/recovery based on the degradation level of the PG >>>>>> From: sjust@redhat.com >>>>>> To: yguang11@outlook.com >>>>>> CC: dzafman@redhat.com >>>>>> >>>>>> I think we discussed this a few weeks ago, but the logs are on my >>>>>> machine at home. I'll get back to you once I take a look tomorrow. >>>>>> -Sam >>>>>> >>>>>> On Tue, Sep 15, 2015 at 9:30 AM, GuangYang <yguang11@outlook.com> wrote: >>>>>>> Hi David and Sam, >>>>>>> Any thoughts on this? >>>>>>> >>>>>>> Thanks, >>>>>>> Guang >>>>>>> >>>>>>> ---------------------------------------- >>>>>>>> From: yguang11@outlook.com >>>>>>>> To: sjust@redhat.com >>>>>>>> CC: ceph-devel@vger.kernel.org; dzafman@redhat.com >>>>>>>> Subject: Priority backfill/recovery based on the degradation level of the PG >>>>>>>> Date: Mon, 14 Sep 2015 14:29:39 -0700 >>>>>>>> >>>>>>>> Hi Sam, >>>>>>>> We discussed this briefly on IRC, I think it might be better to recap with an email. >>>>>>>> >>>>>>>> Currently we schedule the backfill/recovery based on how degrade the PG is, with a factor distinguishing recovery vs. backfill (recovery always has higher priority). The way to calculate the degradation level of a PG is: {expected_pool_size} - {acting_set_size}. I think there are two issues with the current approach: >>>>>>>> >>>>>>>> 1. The current {acting_size_size} might not capture the degradation level over the past intervals. For example, we have two PGs (Erasure Coding with 8 data and 3 parity chunks) 1.0 and 1.1: >>>>>>>> 1.1 At t1, PG 1.0's acting set size becomes 8 while PG 1.1's acting set is 11 >>>>>>>> 1.2 At t2, PG 1.1's acting set size becomes 10 while PG 1.1's acting set is 9 >>>>>>>> 1.3 At t3, we start recovering (e.g. mark out some OSDs) >>>>>>>> With the current algorithm, PG 1.1 will recovery first and then PG 1.0 (if the concurrency is configured as 1), however, from a data durability's perspective, the data written between t1 and t2 are more degraded and risky. >>>>>>>> >>>>>>>> 2. The algorithm does not take EC/replication into account (and EC profile), which might be also important go consider the data durability. >>>>>>>> >>>>>>>> Is my understanding correct here?
No data to display
Actions