Project

General

Profile

Feature #55303

pg autoscaler: only warn about changes that will take many days

Added by Dan van der Ster 10 months ago. Updated 7 months ago.

Status:
New
Priority:
Urgent
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Reviewed:
Affected Versions:
Pull request ID:

Description

Motivation: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/Z2UPETNARNVEPTYYA5Q6J5QBCUWKWTZ2/

We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting rebalancing of
misplaced objects is overwhelming the cluster and impacting MON DB compaction, deep scrub
repairs and us upgrading legacy bluestore OSDs. We have to pause the rebalancing if
misplaced objects or we're going to fall over.

Autoscaler-status tells us that we are reducing our PGs by 700'ish which will take us
over 100 days to complete at our current recovery speed

The autoscaler should not trigger such changes behind the back of the operator.
I propose that it should estimate the amount of time to carry out a split or merge operation, and only "warn" if that operation would take longer than a day (configurable), even if autoscale_mode is on.
Seeing the HEALTH_WARN, the operator can then schedule and carry out the pg split or merge at a time that suits their operations.

History

#1 Updated by Neha Ojha 10 months ago

  • Assignee set to Kamoltat (Junior) Sirivadhna

#2 Updated by Radoslaw Zarzynski 7 months ago

  • Tracker changed from Bug to Feature

Also available in: Atom PDF