Bug #41924: asynchronous recovery can not function under certain circumstances - RADOS - Ceph

Actions

Copy link

Bug #41924

closed

asynchronous recovery can not function under certain circumstances

Added by xie xingguo over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

xie xingguo

Category:

Peering

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

30466

Crash signature (v1):

Crash signature (v2):

Description

guoracle report that:

In the asynchronous recovery feature,
the asynchronous recovery target OSD is selected by last_updata.version,
so that after the peering is completed, the asynchronous recovery target OSDs update the last_update.version, and then go down again, when the asynchronous recovery target OSDs is back online, when peering,there is no pglog difference between the asynchronous recovery targets and the authoritative OSD, resulting in no asynchronous recovery.

https://github.com/ceph/ceph/pull/24004 aimed to solve the problem by persisting the number of missing objects into the disk when peering was done, and then we could take both new approximate missing objects (estimated according to last_update) and historical num_objects_missing into account when determining async_recovery_targets on any new follow-up peering circles.
However, the above comment stands only if we could keep an up-to-date num_objects_missing field for each pg instance under any circumstances, which is unfortunately not true for replicas which have completed peering but never started recovery later (7de35629f562436d2bdb85788bdf97b10db3f556 make sure we'll update num_objects_missing for primary when peering is done, and will keep num_objects_missing up-to-update when each missing object is recovered).

Note that guoracle also suggests to fix the same problem by using last_complete.version to calculate the pglog difference and update the last_complete of the asynchronous recovery target OSD in the copy of peer_info to the latest after the recovery is complete, which should not work well because we might reset last_complete to 0'0 whenever we trim pglog past the minimal need-version of missing set.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by xie xingguo over 4 years ago

Pull request ID set to 30466

Actions

Copy link

Updated by xie xingguo over 4 years ago

Backport changed from mimic,nautilus to nautilus

Actions

Copy link

Updated by Greg Farnum over 4 years ago

Status changed from New to 17

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Status changed from 17 to Fix Under Review

Actions

Copy link

Updated by xie xingguo over 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42141: nautilus: asynchronous recovery can not function under certain circumstances added

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Backport changed from nautilus to nautilus, mimic

Adding mimic backport, since the first attempted fix ( see https://github.com/ceph/ceph/pull/30459 ) targeted mimic.

Actions

Copy link

Updated by Neha Ojha over 4 years ago

@Nathan Weinberg The PR that merged is based on https://github.com/ceph/ceph/pull/24004, which has not been backported to mimic. I don't think we should backport this fix to mimic.

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Backport changed from nautilus, mimic to nautilus

Actions

Copy link

#10

Updated by Nathan Cutler over 4 years ago

Neha Ojha wrote:

@Nathan Weinberg The PR that merged is based on https://github.com/ceph/ceph/pull/24004, which has not been backported to mimic. I don't think we should backport this fix to mimic.

Thanks, Neha.

Actions

Copy link

#11

Updated by Nathan Cutler over 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #41924

asynchronous recovery can not function under certain circumstances

Updated by xie xingguo over 4 years ago

Updated by xie xingguo over 4 years ago

Updated by Greg Farnum over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by xie xingguo over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago