Actions
Bug #9111
closedPG stuck with 'active+remapped' forever with cluster wide change (add/remove OSDs)
Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After adding/removing OSDs, some PGs stuck with 'active+remapped' forever.
1. ceph -s
-bash-4.1$ ceph -s
cluster 78c20789-deb2-4ec9-918f-5867d59fdd91 health HEALTH_WARN 16 pgs degraded; 1 pgs incomplete; 1 pgs stuck inactive; 50 pgs stuck unclean; recovery 17706/6171055 objects degraded (0.287%) monmap e1: 1 mons at {osd140=10.193.207.72:6789/0}, election epoch 1, quorum 0 osd140 osdmap e737: 29 osds: 29 up, 29 in pgmap v42187: 632 pgs, 9 pools, 2191 GB data, 548 kobjects 2966 GB used, 9038 GB / 12004 GB avail 17706/6171055 objects degraded (0.287%) 582 active+clean 1 incomplete 16 active+degraded 33 active+remapped
2. ceph osd tree
-bash-4.1$ ceph osd tree # id weight type name up/down reweight -1 12.8 root default -2 4.4 host osd141 1 0.4 osd.1 DNE 3 0.4 osd.3 up 1 6 0.4 osd.6 up 1 7 0.4 osd.7 up 1 8 0.4 osd.8 up 1 9 0.4 osd.9 up 1 10 0.4 osd.10 up 1 11 0.4 osd.11 up 1 12 0.4 osd.12 up 1 13 0.4 osd.13 up 1 0 0.4 osd.0 DNE -3 4 host osd142 2 0.4 osd.2 up 1 14 0.4 osd.14 up 1 15 0.4 osd.15 up 1 16 0.4 osd.16 up 1 17 0.4 osd.17 up 1 18 0.4 osd.18 up 1 19 0.4 osd.19 up 1 20 0.4 osd.20 up 1 21 0.4 osd.21 up 1 22 0.4 osd.22 up 1 -4 4.4 host osd143 4 0.4 osd.4 DNE 5 0.4 osd.5 up 1 23 0.4 osd.23 up 1 24 0.4 osd.24 up 1 25 0.4 osd.25 up 1 26 0.4 osd.26 up 1 27 0.4 osd.27 up 1 28 0.4 osd.28 up 1 29 0.4 osd.29 up 1 30 0.4 osd.30 up 1 31 0.4 osd.31 up 1
3. ceph pg dump | grep remapped
17.f4 1162 0 0 0 4873781248 1162 1162 active+remapped 2014-08-14 07:40:55.607346 706'1162 737:4834 [29,8,13,2,18,16,22,17,2147483647,28,26] 29 [29,8,13,2,18,16,22,17,15,28,26] 29 706'1162 2014-08-13 10:08:32.270670 0'0 2014-08-11 07:31:38.793882 17.cb 1084 0 0 0 4546625536 1084 1084 active+remapped 2014-08-14 07:40:55.470562 706'1084 737:6710 [31,11,18,2147483647,13,19,3,14,5,23,8] 31 [31,11,18,6,13,19,3,14,5,23,8] 31 706'1084 2014-08-13 12:28:46.357220 0'0 2014-08-11 07:31:38.789353 17.dd 1066 0 1066 0 4471128064 1066 1066 active+remapped 2014-08-14 08:31:39.123884 706'1066 737:9773 [28,2147483647,31,2,21,7,26,27,14,30,12] 28 [28,23,31,2,21,7,26,27,14,30,12] 28 706'1066 2014-08-13 12:10:10.342203 0'0 2014-08-11 07:31:38.803181 17.db 1066 0 1066 0 4471128064 1066 1066 active+remapped 2014-08-14 08:30:55.143277 706'1066 737:9809 [20,11,6,3,10,26,21,15,13,2147483647,2] 20 [20,11,6,3,10,26,21,15,13,28,2] 20 706'1066 2014-08-13 10:31:58.141533 0'0 2014-08-11 07:31:38.797543 17.a8 1095 0 2190 0 4592762880 1095 1095 active+remapped 2014-08-14 08:17:44.932635 706'1095 737:11090 [12,2147483647,9,11,10,23,3,26,6,27,13] 12 [12,22,9,11,10,23,3,26,6,27,13] 12 706'1095 2014-08-13 08:36:39.518331 0'0 2014-08-11 07:31:36.788551 17.88 1098 0 0 0 4605345792 1098 1098 active+remapped 2014-08-14 08:08:31.066301 706'1098 737:2206 [28,27,29,14,31,10,19,2147483647,5,9,17] 28 [28,27,29,14,31,10,19,11,5,9,17] 28 706'1098 2014-08-13 12:04:21.338467 0'0 2014-08-11 07:31:39.871384 17.9f 1098 0 0 0 4605345792 1098 1098 active+remapped 2014-08-14 07:40:55.064181 706'1098 737:6770 [6,20,18,12,17,24,11,10,7,23,2147483647] 6 [6,20,18,12,17,24,11,10,7,23,5] 6 706'1098 2014-08-13 10:03:41.250542 0'0 2014-08-11 07:31:38.808509
Note that here up set contains a malformed OSD number - 2147483647.
4. ceph -v
ceph version 0.82-456-g276dbfc (276dbfc4cbfe56e03615b3c387b5cbbebf21a1bc)
Updated by Guang Yang over 9 years ago
Right after I filed this bug, I got some clue, I found the problem came from those removed OSDs (which has status DNE), after I did crush remove for those OSDs, it is good. Is the behavior expected?
Updated by Samuel Just over 9 years ago
Ah, I think it's your very wide EC stripe. Try increasing total retries on the crush rule for that pool.
Updated by Samuel Just over 9 years ago
We probably want to add a heuristic that notices if a pool might have this problem and point the user at a doc page.
Actions