Project

General

Profile

Actions

Bug #18145

closed

pgs stuck in remapped after recovery on cluster with many osds down

Added by Brad Hubbard over 7 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During a maintenance event in which many osds were removed and some existing osds were down the cluster went into recovery. Once recovery had completed ten pgs were left stuck in "remapped" state and only a restart of the primaries involved resolved the issue and allowed these pgs to complete peering successfully.

Actions #1

Updated by Brad Hubbard over 7 years ago

Correction: Issue was resolved not through restarting the primary but merely by marking them down to force peering to restart.

Actions #2

Updated by Samuel Just over 7 years ago

pg_stat state up up_primary acting acting_primary
13.0 remapped [214,224,120] 214 [214,120] 214
40.105f remapped [256,248,124] 256 [124,258] 124
40.3abb remapped [202,51,218] 202 [202,51] 202
11.2 remapped [214,224,120] 214 [214,120] 214
40.32b8 remapped [198,144,204] 198 [198,144] 198
40.507 remapped [91,205,197] 91 [91,205] 91
40.345b remapped [177,65,120] 177 [65,120] 65
40.31fa remapped [58,169,224] 58 [58,169] 58
40.13f remapped [185,15,174] 185 [185,15] 185
40.13b4 remapped [74,164,247] 74 [74,164] 74"

pg_temp 11.2 [214,120,261]
pg_temp 13.0 [214,120,261]
pg_temp 40.13f [185,15,261]
pg_temp 40.507 [91,205,261]
pg_temp 40.105f [124,258,261]
pg_temp 40.13b4 [74,164,261]
pg_temp 40.31fa [58,169,261]
pg_temp 40.32b8 [198,144,261]
pg_temp 40.345b [65,120,261]
pg_temp 40.3abb [202,51,261]

We can infer that 261 is down (in pg_temp, but not acting). state = remapped is interesting. If we were in Peering, it would include +peering. Hmm

Actions #3

Updated by Samuel Just over 7 years ago

  • Priority changed from Normal to High
Actions #4

Updated by Brad Hubbard about 7 years ago

  • Status changed from New to Need More Info

An extensive code review failed to reveal how this might eventuate.

Without more information I don't think we can proceed further.

Actions #5

Updated by Brad Hubbard about 7 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF