Project

General

Profile

Actions

Bug #23117

open

PGs stuck in "activating" after osd_max_pg_per_osd_hard_ratio has been exceeded once

Added by Oliver Freyermuth about 6 years ago. Updated over 1 year ago.

Status:
Fix Under Review
Priority:
High
Assignee:
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor, OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In the following setup:
  • 6 OSD hosts
  • Each host with 32 disks = 32 OSDs
  • Pool with 2048 PGs, EC, k=4, m=2, crush failure domain host

When (re)installing the 6th host and creating the first OSD on it, PG overdose protection kicks in shortly,
since all PGs need to have shards on the 6th host.
For this reason, PGs enter "activating" state and get stuck there.

However, even when all 32 OSDs are added on the 6th host, the PGs are still stuck in activating and data stays unavailable (even though ODSs were added).
This situation does not resolve by itself.

This issue can be resolved by setting:

osd_max_pg_per_osd_hard_ratio = 32

before the redeployment of a host, thus effectively turning off overdose protection.

For one example PG in the stuck state:
# ceph pg dump all | grep 2.7f6
dumped all
2.7f6     38086                  0    38086         0       0 2403961148 1594     1594           activating+undersized+degraded+remapped 2018-02-24 19:50:01.654185 39755'134350  39946:274873  [153,6,42,95,115,167]        153 [153,NONE,42,95,115,167]            153 39559'109078 2018-02-24 04:01:57.991376     36022'53756 2018-02-22 18:03:40.386421             0 

I have uploaded OSD logs from all involved OSDs:
  • c3953bf7-b482-4705-a7a3-df354453a933 for OSD 6 (which was reinstalled, so maybe this is irrelevant)
  • 833c07e2-09ff-409c-b68f-1a87e7bfc353 for OSD 4, which was the first OSD reinstalled on the new OSD host, so it should have been affected by overdose protection
  • cb146d33-e6cb-4c84-8b15-543728bbc5dd for OSD.42
  • f716a2d1-e7ef-46d7-b4fc-dfc440e6fe59 for OSD.95
  • fc7ec27a-82c9-4fb4-94dc-5dd64335e3b4 for OSD.115
  • 51213f5f-1b91-42b0-8c0c-8acf3622195f for OSD.153
  • 3d67f227-4dba-4c93-9fe1-7951d3d32f30 for OSD 167

I have also uploaded the ceph.conf of osd001 which was the reinstalled OSD host:
64744f9a-e136-40f9-a392-4a6f1b34a74e
All other OSD hosts have

osd_max_pg_per_osd_hard_ratio = 32

set (which prevents the issue).

Additionally, I have uploaded all OSD logs of the reinstalled osd001 machine:
38ddd08f-6c66-4a88-8e83-f4eff0ae5d10
(so this includes osd.4 and osd.6 already linked above).


Related issues 2 (1 open1 closed)

Related to RADOS - Bug #48298: hitting mon_max_pg_per_osd right after creating OSD, then decreases slowlyNew

Actions
Is duplicate of RADOS - Bug #57185: EC 4+2 PG stuck in activating+degraded+remappedDuplicate

Actions
Actions

Also available in: Atom PDF