Feature #13120: osd: prioritize more degraded PGs for recovery by considering the missing_loc of the PG - Ceph - Ceph

Actions

Copy link

Feature #13120

open

osd: prioritize more degraded PGs for recovery by considering the missing_loc of the PG

Added by Guang Yang over 8 years ago.

Status:

New

Priority:

Normal

Assignee:

Guang Yang

Category:

OSD

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

Copy the email as below:

>> Date: Wed, 16 Sep 2015 09:53:34 -0700
>> Subject: Re: Priority backfill/recovery based on the degradation level of the PG
>> From: sjust@redhat.com
>> To: yguang11@outlook.com
>>
>> Right, when we request the reservation from our peers (and from
>> ourself), we can make the priority dependent on the objects actually
>> degraded in missing_loc. We can also, later, use missing_loc to
>> determine the order in which to recover the objects. I don't see what
>> considering the history gets us that we can't get from simply
>> examining missing_loc.
>> -Sam
>>
>> On Wed, Sep 16, 2015 at 9:41 AM, GuangYang <yguang11@outlook.com> wrote:
>>> Thanks Sam, my comments inline..
>>>
>>> ----------------------------------------
>>>> Date: Wed, 16 Sep 2015 09:29:08 -0700
>>>> Subject: Re: Priority backfill/recovery based on the degradation level of the PG
>>>> From: sjust@redhat.com
>>>> To: yguang11@outlook.com
>>>>
>>>> From Aug 26:
>>>>
>>>> Aug 26 10:14:47 <yguang11> but I think it might make more sense
>>>> to calculate the priority based on how real degrade on objects level?
>>>> For example, during peering, we check each interval to get a sense of
>>>> how degraded for each interval ( pg_interval_t::check_new_interval).
>>>> Aug 26 10:15:09 <sjusthm> yguang11: I don't understand
>>>> Aug 26 10:15:26 * ngoswami has quit (Quit: Leaving)
>>>> Aug 26 10:15:33 <yguang11> For example, we have 8 + 3 as EC profile
>>>> Aug 26 10:16:02 <yguang11> and the PG went through interval 1, 2,
>>>> 3, during interval 1, it has 8 acting OSDs, interval 2, it has 9
>>>> Aug 26 10:16:46 <yguang11> interval 3, it has 10, and at some
>>>> point after interval 3, it starts backfilling, at which point,it has
>>>> 10 acting OSDs
>>>> Aug 26 10:17:03 <yguang11> but actually, the objects written at
>>>> interval 1 are more degraded, properly than others?
>>>> Aug 26 10:17:14 <sjusthm> those objects are recovery
>>>> Aug 26 10:17:14 * sankarshan has quit (Quit: Are you sure you
>>>> want to quit this channel (Cancel/Ok) ?)
>>>> Aug 26 10:17:20 <sjusthm> and that happens at a higher priority iirc
>>>> Aug 26 10:17:26 <sjusthm> log based recovery happens before backfill
>>>> Aug 26 10:17:30 <yguang11> right
>>>> Aug 26 10:17:39 <sjusthm> we do those first because they block writes
>>>> Aug 26 10:17:44 <sjusthm> backfill doesn't block writes
>>>> Aug 26 10:17:58 <sjusthm> (except briefly while the object is
>>>> actually being recovered)
>>>> Aug 26 10:18:27 <sjusthm> yguang11: yeah, it could do that
>>>> Aug 26 10:18:33 <sjusthm> currently recovery happens in version
>>>> order more or less
>>>> Aug 26 10:18:45 <sjusthm> unless an operation forces an object
>>>> to be recovered first
>>>> Aug 26 10:18:55 * swami1 has quit (Ping timeout: 480 seconds)
>>>> Aug 26 10:19:05 <yguang11> right, my concern is, when consider
>>>> *degraded* on those patches, they only check the *current* state of
>>>> the PG
>>>> Aug 26 10:19:05 <sjusthm> but since there are never more than
>>>> <log length> missing or degraded objects, it doesn't seem like an
>>>> important optimization
>>>> Aug 26 10:19:18 <yguang11> rather than a historic view of the
>>>> itervals it went throught
>>>> Aug 26 10:19:29 <sjusthm> but only the current state matters?
>>>> Aug 26 10:19:40 <sjusthm> I don't understand the distinction
>>>> Aug 26 10:19:52 <sjusthm> we know which objects need to be
>>>> recovered in log based recovery
>>>> Aug 26 10:19:58 <sjusthm> we also know how degraded each is (missing_loc)
>>>> Aug 26 10:20:07 <sjusthm> so we could recover the ones with
>>>> fewer copies first
>>>> Aug 26 10:20:11 <sjusthm> is that what you mean?
>>>> Aug 26 10:20:18 <sjusthm> but this is all separate from backfill
>>>> Aug 26 10:20:23 <sage> one object might be missing 2 replicas (from
>>>> an ealier interval that never fully recovered), but the current
>>>> interval is only down by 1 replica.
>>>> Aug 26 10:20:27 <sjusthm> which doesn't happen until log based
>>>> recovery is done at the moment
>>>> Aug 26 10:20:42 <sjusthm> sage: yes, but we don't need to look
>>>> at the intervals for that
>>>> Aug 26 10:20:48 <sjusthm> missing_loc would tell us how many
>>>> copies we have
>>>> Aug 26 10:20:52 <sage> yeah
>>>> Aug 26 10:20:59 <sjusthm> yguang11: is that what you mean?
>>>> Aug 26 10:21:43 <yguang11> Does the priority consider the missing_loc?
>>>> Aug 26 10:21:49 <sjusthm> no, but it could
>>>>
>>>> I think the conclusion was that the history would be irrelevant for
>>>> backfill, and would only matter for log based recovery. We already
>>>> prioritize that case over backfill, and missing_loc tells us exactly
>>>> how degraded each object is. I don't see how the history helps us --
>>>> we could simply use missing_loc directly to determine the priority
>>>> when requesting recovery reservations without bothering to look at the
>>>> past intervals.
>>> As I understand it, the missing_loc only handle the priority for objects in a PG to recovery, given the fact that that PG is already chosen for recovery. One concern I have (I might be wrong here) is that when we choose which PG to recovery, that decision did not consider a history of intervals but only check the acting set of the PG for time being?
>>>
>>>> -Sam
>>>>
>>>> On Tue, Sep 15, 2015 at 5:26 PM, GuangYang <yguang11@outlook.com> wrote:
>>>>> Thanks Sam.
>>>>>
>>>>> For recovering, there are actually two sets of priorities:
>>>>> 1. Choose which PG to do recovery first
>>>>> 2. Once the PG is chosen, pick which object to start with
>>>>>
>>>>> Our discussion last time was mostly around the second one, while my concerns in this thread is mostly the first one.
>>>>>
>>>>> Thanks,
>>>>> Guang
>>>>>
>>>>> ----------------------------------------
>>>>>> Date: Tue, 15 Sep 2015 10:33:14 -0700
>>>>>> Subject: Re: Priority backfill/recovery based on the degradation level of the PG
>>>>>> From: sjust@redhat.com
>>>>>> To: yguang11@outlook.com
>>>>>> CC: dzafman@redhat.com
>>>>>>
>>>>>> I think we discussed this a few weeks ago, but the logs are on my
>>>>>> machine at home. I'll get back to you once I take a look tomorrow.
>>>>>> -Sam
>>>>>>
>>>>>> On Tue, Sep 15, 2015 at 9:30 AM, GuangYang <yguang11@outlook.com> wrote:
>>>>>>> Hi David and Sam,
>>>>>>> Any thoughts on this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Guang
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>>> From: yguang11@outlook.com
>>>>>>>> To: sjust@redhat.com
>>>>>>>> CC: ceph-devel@vger.kernel.org; dzafman@redhat.com
>>>>>>>> Subject: Priority backfill/recovery based on the degradation level of the PG
>>>>>>>> Date: Mon, 14 Sep 2015 14:29:39 -0700
>>>>>>>>
>>>>>>>> Hi Sam,
>>>>>>>> We discussed this briefly on IRC, I think it might be better to recap with an email.
>>>>>>>>
>>>>>>>> Currently we schedule the backfill/recovery based on how degrade the PG is, with a factor distinguishing recovery vs. backfill (recovery always has higher priority). The way to calculate the degradation level of a PG is: {expected_pool_size} - {acting_set_size}. I think there are two issues with the current approach:
>>>>>>>>
>>>>>>>> 1. The current {acting_size_size} might not capture the degradation level over the past intervals. For example, we have two PGs (Erasure Coding with 8 data and 3 parity chunks) 1.0 and 1.1:
>>>>>>>> 1.1 At t1, PG 1.0's acting set size becomes 8 while PG 1.1's acting set is 11
>>>>>>>> 1.2 At t2, PG 1.1's acting set size becomes 10 while PG 1.1's acting set is 9
>>>>>>>> 1.3 At t3, we start recovering (e.g. mark out some OSDs)
>>>>>>>> With the current algorithm, PG 1.1 will recovery first and then PG 1.0 (if the concurrency is configured as 1), however, from a data durability's perspective, the data written between t1 and t2 are more degraded and risky.
>>>>>>>>
>>>>>>>> 2. The algorithm does not take EC/replication into account (and EC profile), which might be also important go consider the data durability.
>>>>>>>>
>>>>>>>> Is my understanding correct here?

No data to display

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Feature #13120

osd: prioritize more degraded PGs for recovery by considering the missing_loc of the PG