Bug #18145
closedpgs stuck in remapped after recovery on cluster with many osds down
0%
Description
During a maintenance event in which many osds were removed and some existing osds were down the cluster went into recovery. Once recovery had completed ten pgs were left stuck in "remapped" state and only a restart of the primaries involved resolved the issue and allowed these pgs to complete peering successfully.
Updated by Brad Hubbard over 7 years ago
Correction: Issue was resolved not through restarting the primary but merely by marking them down to force peering to restart.
Updated by Samuel Just over 7 years ago
pg_stat state up up_primary acting acting_primary
13.0 remapped [214,224,120] 214 [214,120] 214
40.105f remapped [256,248,124] 256 [124,258] 124
40.3abb remapped [202,51,218] 202 [202,51] 202
11.2 remapped [214,224,120] 214 [214,120] 214
40.32b8 remapped [198,144,204] 198 [198,144] 198
40.507 remapped [91,205,197] 91 [91,205] 91
40.345b remapped [177,65,120] 177 [65,120] 65
40.31fa remapped [58,169,224] 58 [58,169] 58
40.13f remapped [185,15,174] 185 [185,15] 185
40.13b4 remapped [74,164,247] 74 [74,164] 74"
pg_temp 11.2 [214,120,261]
pg_temp 13.0 [214,120,261]
pg_temp 40.13f [185,15,261]
pg_temp 40.507 [91,205,261]
pg_temp 40.105f [124,258,261]
pg_temp 40.13b4 [74,164,261]
pg_temp 40.31fa [58,169,261]
pg_temp 40.32b8 [198,144,261]
pg_temp 40.345b [65,120,261]
pg_temp 40.3abb [202,51,261]
We can infer that 261 is down (in pg_temp, but not acting). state = remapped is interesting. If we were in Peering, it would include +peering. Hmm
Updated by Brad Hubbard about 7 years ago
- Status changed from New to Need More Info
An extensive code review failed to reveal how this might eventuate.
Without more information I don't think we can proceed further.
Updated by Brad Hubbard about 7 years ago
- Status changed from Need More Info to Can't reproduce