Bug #49364: pg_autoscaler causes device_health_metrics pool to use 128 pgs preventing rgw from being deployed - mgr - Ceph

Actions

Copy link

Bug #49364

closed

pg_autoscaler causes device_health_metrics pool to use 128 pgs preventing rgw from being deployed

Added by Daniel Pivonka about 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Kamoltat (Junior) Sirivadhna

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific, octopus

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

43999

Crash signature (v1):

Crash signature (v2):

Description

when trying to deploy rgw on a master or pacific cluster the device_health_metrics pool is using 128 pgs. when you try to setup a rgw service it will attempt to create pools but it fails because there are not enough pgs left to stay under the mon_max_pg_per_osd limit.

this could be the result of changes to the pg autoscaler https://github.com/ceph/ceph/pull/38805 https://github.com/ceph/ceph/pull/39248

Files

LXKPcmdN.txt (3.14 KB) LXKPcmdN.txt

Daniel Pivonka, 02/18/2021 07:45 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Neha Ojha about 3 years ago

reverting in octopus for now: https://github.com/ceph/ceph/pull/39560

Actions

Copy link

Updated by Neha Ojha about 3 years ago

Project changed from Ceph to mgr
Subject changed from device_health_metrics pool is using 128 pgs preventing rgw from being deployed to pg_autoscaler causes device_health_metrics pool to use 128 pgs preventing rgw from being deployed

Caused by https://tracker.ceph.com/issues/49118.

Actions

Copy link

Updated by Neha Ojha about 3 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Yuri Weinstein about 3 years ago

Neha Ojha wrote:

reverting in octopus for now: https://github.com/ceph/ceph/pull/39560

merged

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

Assignee set to Kamoltat (Junior) Sirivadhna

Actions

Copy link

Updated by Daniel Pivonka about 3 years ago

Blocks Bug #49435: cephadm: rgw not getting deployed due to HEALTH_WARN added

Actions

Copy link

Updated by Neha Ojha about 3 years ago

Backport set to pacific, octopus, nautilus

Actions

Copy link

Updated by Juan Miguel Olmo Martínez about 3 years ago

Severity changed from 3 - minor to 1 - critical

Actions

Copy link

Updated by Juan Miguel Olmo Martínez about 3 years ago

Not be able to deploy/use any Ceph service is a critical issue

Actions

Copy link

#10

Updated by Neha Ojha about 3 years ago

Status changed from New to Fix Under Review
Pull request ID set to 39833

Actions

Copy link

#11

Updated by Neha Ojha about 3 years ago

reverting the original patch in pacific for now https://github.com/ceph/ceph/pull/39921

Actions

Copy link

#12

Updated by Sebastian Wagner about 3 years ago

Blocks deleted (Bug #49435: cephadm: rgw not getting deployed due to HEALTH_WARN)

Actions

Copy link

#13

Updated by Loïc Dachary over 2 years ago

Status changed from Fix Under Review to Pending Backport
Backport changed from pacific, octopus, nautilus to pacific, octopus

Actions

Copy link

#14

Updated by Backport Bot over 2 years ago

Copied to Backport #52519: pacific: pg_autoscaler causes device_health_metrics pool to use 128 pgs preventing rgw from being deployed added

Actions

Copy link

#15

Updated by Backport Bot over 2 years ago

Copied to Backport #52520: octopus: pg_autoscaler causes device_health_metrics pool to use 128 pgs preventing rgw from being deployed added

Actions

Copy link

#16

Updated by Kamoltat (Junior) Sirivadhna over 2 years ago

Status changed from Pending Backport to Resolved
Pull request ID changed from 39833 to 43999

https://github.com/ceph/ceph/pull/43999 resolved the issue in master and pacific, the PR ensures that default meta pools that are created won't scale the pgs by a large amount and block the deployment of other services. For octopus, we have reverted the scale-down feature: https://github.com/ceph/ceph/pull/39560 so there won't be a problem there as well

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #49364

pg_autoscaler causes device_health_metrics pool to use 128 pgs preventing rgw from being deployed

Updated by Neha Ojha about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Yuri Weinstein about 3 years ago

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

Updated by Daniel Pivonka about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Juan Miguel Olmo Martínez about 3 years ago

Updated by Juan Miguel Olmo Martínez about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Sebastian Wagner about 3 years ago

Updated by Loïc Dachary over 2 years ago

Updated by Backport Bot over 2 years ago

Updated by Backport Bot over 2 years ago

Updated by Kamoltat (Junior) Sirivadhna over 2 years ago