Project

General

Profile

Bug #39099

Give recovery for inactive PGs a higher priority

Added by David Zafman about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Backfill inactive gets priority 220 and we should make sure that if we can have inactive that needs recovery only it should be 180 priority but should also be base 220.


Related issues

Related to RADOS - Feature #39339: prioritize backfill of metadata pools, automatically In Progress
Related to RADOS - Documentation #39011: Document how get_recovery_priority() and get_backfill_priority() impacts recovery order Resolved 03/28/2019
Duplicated by RADOS - Bug #38930: ceph osd safe-to-destroy wrongly approves any out osd Duplicate 03/25/2019
Precedes RADOS - Bug #35808: ceph osd ok-to-stop result dosen't match the real situation Rejected
Copied to RADOS - Backport #39504: nautilus: Give recovery for inactive PGs a higher priority Resolved
Copied to RADOS - Backport #39505: luminous: Give recovery for inactive PGs a higher priority Rejected
Copied to RADOS - Backport #39506: mimic: Give recovery for inactive PGs a higher priority Rejected

History

#1 Updated by David Zafman about 2 years ago

  • Subject changed from Check if we should add an inactive check to get_recovery_priority() to Give recovery for inactive PGs a higher priority
  • Pull request ID set to 27503

#2 Updated by David Zafman about 2 years ago

  • Status changed from New to In Progress

#3 Updated by David Zafman about 2 years ago

Checking acting.size() < pool.info.min_size is wrong. During recovery acting == up. So if active.size() < pool.info.min_size, the result of the recovery will still be too low to allow I/O. This would be a very uncommon case plus it doesn't help us to prioritize doing that recovery. Also, recovery allows simultaneous operations because even if data hasn't transferred, those operations can be logged on all OSDs. I'm not sure how this interacts with pg log size limiting. If recovery is slow due to large objects and clients send lots of small operations, the count of log entries is going to go up. Will log trimming slow client requests, so the log doesn't grow unbounded?

When recovery is degraded, we don't differentiate whether or not there are at least min_size replicas of every object. In that case we would want a priority boost, but I'm not sure the best way to determine between the 2 cases. Also, as recovery proceeds the degraded will go down, but at the point that example 1 has 100 degraded like example 2 it is really 100 objects on OSD 1 and 50 objects on OSD 0 and OSD 2. So in actual fact 50 objects are still below min_size assuming we push the same object to both OSDs as recovery goes along.

Example 1, size 3 min_size 2 replicated pool
Should get a priority boost because for 100 objects has 200 degraded (below min_size with 100 replicas existing)
PG 1.0 [1, 0, 2] active+clean
OSDs 0 and 2 go down and out
PG 1.0 [1, 3, 4]. active+recovering+degraded

Example 2, size 3 min_size 2 replicated pool
No priority boost because 100 objects has 100 degraded (min_size 2 ok with 200 existing replicas)
PG 1.0 [1, 0, 2] active+clean
OSDs 2 go down and out
PG 1.0 [1, 0, 3]. active+recovering+degraded

#4 Updated by David Zafman about 2 years ago

  • Related to Bug #38930: ceph osd safe-to-destroy wrongly approves any out osd added

#5 Updated by David Zafman about 2 years ago

  • Related to Feature #39339: prioritize backfill of metadata pools, automatically added

#6 Updated by David Zafman about 2 years ago

  • Status changed from In Progress to Pending Backport
  • Backport set to luminous, mimic, nautilus

#7 Updated by David Zafman about 2 years ago

  • Precedes Bug #35808: ceph osd ok-to-stop result dosen't match the real situation added

#8 Updated by Nathan Cutler about 2 years ago

  • Copied to Backport #39504: nautilus: Give recovery for inactive PGs a higher priority added

#9 Updated by Nathan Cutler about 2 years ago

  • Copied to Backport #39505: luminous: Give recovery for inactive PGs a higher priority added

#10 Updated by Nathan Cutler about 2 years ago

  • Copied to Backport #39506: mimic: Give recovery for inactive PGs a higher priority added

#11 Updated by Neha Ojha about 2 years ago

David, we should discuss whether we want to backport this all the way to luminous or just to nautilus.

#12 Updated by David Zafman about 2 years ago

  • Related to Documentation #39011: Document how get_recovery_priority() and get_backfill_priority() impacts recovery order added

#13 Updated by Nathan Cutler about 2 years ago

  • Duplicated by Bug #38930: ceph osd safe-to-destroy wrongly approves any out osd added

#14 Updated by Nathan Cutler about 2 years ago

  • Related to deleted (Bug #38930: ceph osd safe-to-destroy wrongly approves any out osd)

#15 Updated by David Zafman over 1 year ago

  • Status changed from Pending Backport to Resolved
  • Backport changed from luminous, mimic, nautilus to nautilus

Also available in: Atom PDF