Bug #18145: pgs stuck in remapped after recovery on cluster with many osds down - Ceph - Ceph

Actions

Copy link

Bug #18145

closed

pgs stuck in remapped after recovery on cluster with many osds down

Added by Brad Hubbard over 7 years ago. Updated about 7 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Brad Hubbard

Category:

OSD

Target version:

% Done:

Source:

Support

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v0.94.3

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

During a maintenance event in which many osds were removed and some existing osds were down the cluster went into recovery. Once recovery had completed ten pgs were left stuck in "remapped" state and only a restart of the primaries involved resolved the issue and allowed these pgs to complete peering successfully.

Actions

Copy link

Updated by Brad Hubbard over 7 years ago

Correction: Issue was resolved not through restarting the primary but merely by marking them down to force peering to restart.

Actions

Copy link

Updated by Samuel Just over 7 years ago

pg_stat state up up_primary acting acting_primary
13.0 remapped [214,224,120] 214 [214,120] 214
40.105f remapped [256,248,124] 256 [124,258] 124
40.3abb remapped [202,51,218] 202 [202,51] 202
11.2 remapped [214,224,120] 214 [214,120] 214
40.32b8 remapped [198,144,204] 198 [198,144] 198
40.507 remapped [91,205,197] 91 [91,205] 91
40.345b remapped [177,65,120] 177 [65,120] 65
40.31fa remapped [58,169,224] 58 [58,169] 58
40.13f remapped [185,15,174] 185 [185,15] 185
40.13b4 remapped [74,164,247] 74 [74,164] 74"

pg_temp 11.2 [214,120,261]
pg_temp 13.0 [214,120,261]
pg_temp 40.13f [185,15,261]
pg_temp 40.507 [91,205,261]
pg_temp 40.105f [124,258,261]
pg_temp 40.13b4 [74,164,261]
pg_temp 40.31fa [58,169,261]
pg_temp 40.32b8 [198,144,261]
pg_temp 40.345b [65,120,261]
pg_temp 40.3abb [202,51,261]

We can infer that 261 is down (in pg_temp, but not acting). state = remapped is interesting. If we were in Peering, it would include +peering. Hmm

Actions

Copy link