Project

General

Profile

Actions

Bug #9111

closed

PG stuck with 'active+remapped' forever with cluster wide change (add/remove OSDs)

Added by Guang Yang over 9 years ago. Updated over 9 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After adding/removing OSDs, some PGs stuck with 'active+remapped' forever.

1. ceph -s
-bash-4.1$ ceph -s

    cluster 78c20789-deb2-4ec9-918f-5867d59fdd91
     health HEALTH_WARN 16 pgs degraded; 1 pgs incomplete; 1 pgs stuck inactive; 50 pgs stuck unclean; recovery 17706/6171055 objects degraded (0.287%)
     monmap e1: 1 mons at {osd140=10.193.207.72:6789/0}, election epoch 1, quorum 0 osd140
     osdmap e737: 29 osds: 29 up, 29 in
      pgmap v42187: 632 pgs, 9 pools, 2191 GB data, 548 kobjects
            2966 GB used, 9038 GB / 12004 GB avail
            17706/6171055 objects degraded (0.287%)
                 582 active+clean
                   1 incomplete
                  16 active+degraded
                  33 active+remapped

2. ceph osd tree

-bash-4.1$ ceph osd tree
# id    weight    type name    up/down    reweight
-1    12.8    root default
-2    4.4        host osd141
1    0.4            osd.1    DNE        
3    0.4            osd.3    up    1    
6    0.4            osd.6    up    1    
7    0.4            osd.7    up    1    
8    0.4            osd.8    up    1    
9    0.4            osd.9    up    1    
10    0.4            osd.10    up    1    
11    0.4            osd.11    up    1    
12    0.4            osd.12    up    1    
13    0.4            osd.13    up    1    
0    0.4            osd.0    DNE        
-3    4        host osd142
2    0.4            osd.2    up    1    
14    0.4            osd.14    up    1    
15    0.4            osd.15    up    1    
16    0.4            osd.16    up    1    
17    0.4            osd.17    up    1    
18    0.4            osd.18    up    1    
19    0.4            osd.19    up    1    
20    0.4            osd.20    up    1    
21    0.4            osd.21    up    1    
22    0.4            osd.22    up    1    
-4    4.4        host osd143
4    0.4            osd.4    DNE        
5    0.4            osd.5    up    1    
23    0.4            osd.23    up    1    
24    0.4            osd.24    up    1    
25    0.4            osd.25    up    1    
26    0.4            osd.26    up    1    
27    0.4            osd.27    up    1    
28    0.4            osd.28    up    1    
29    0.4            osd.29    up    1    
30    0.4            osd.30    up    1    
31    0.4            osd.31    up    1    

3. ceph pg dump | grep remapped

17.f4    1162    0    0    0    4873781248    1162    1162    active+remapped    2014-08-14 07:40:55.607346    706'1162    737:4834    [29,8,13,2,18,16,22,17,2147483647,28,26]    29    [29,8,13,2,18,16,22,17,15,28,26]    29    706'1162    2014-08-13 10:08:32.270670    0'0    2014-08-11 07:31:38.793882
17.cb    1084    0    0    0    4546625536    1084    1084    active+remapped    2014-08-14 07:40:55.470562    706'1084    737:6710    [31,11,18,2147483647,13,19,3,14,5,23,8]    31    [31,11,18,6,13,19,3,14,5,23,8]    31    706'1084    2014-08-13 12:28:46.357220    0'0    2014-08-11 07:31:38.789353
17.dd    1066    0    1066    0    4471128064    1066    1066    active+remapped    2014-08-14 08:31:39.123884    706'1066    737:9773    [28,2147483647,31,2,21,7,26,27,14,30,12]    28    [28,23,31,2,21,7,26,27,14,30,12]    28    706'1066    2014-08-13 12:10:10.342203    0'0    2014-08-11 07:31:38.803181
17.db    1066    0    1066    0    4471128064    1066    1066    active+remapped    2014-08-14 08:30:55.143277    706'1066    737:9809    [20,11,6,3,10,26,21,15,13,2147483647,2]    20    [20,11,6,3,10,26,21,15,13,28,2]    20    706'1066    2014-08-13 10:31:58.141533    0'0    2014-08-11 07:31:38.797543
17.a8    1095    0    2190    0    4592762880    1095    1095    active+remapped    2014-08-14 08:17:44.932635    706'1095    737:11090    [12,2147483647,9,11,10,23,3,26,6,27,13]    12    [12,22,9,11,10,23,3,26,6,27,13]    12    706'1095    2014-08-13 08:36:39.518331    0'0    2014-08-11 07:31:36.788551
17.88    1098    0    0    0    4605345792    1098    1098    active+remapped    2014-08-14 08:08:31.066301    706'1098    737:2206    [28,27,29,14,31,10,19,2147483647,5,9,17]    28    [28,27,29,14,31,10,19,11,5,9,17]    28    706'1098    2014-08-13 12:04:21.338467    0'0    2014-08-11 07:31:39.871384
17.9f    1098    0    0    0    4605345792    1098    1098    active+remapped    2014-08-14 07:40:55.064181    706'1098    737:6770    [6,20,18,12,17,24,11,10,7,23,2147483647]    6    [6,20,18,12,17,24,11,10,7,23,5]    6    706'1098    2014-08-13 10:03:41.250542    0'0    2014-08-11 07:31:38.808509

Note that here up set contains a malformed OSD number - 2147483647.

4. ceph -v
ceph version 0.82-456-g276dbfc (276dbfc4cbfe56e03615b3c387b5cbbebf21a1bc)

Actions #1

Updated by Guang Yang over 9 years ago

Right after I filed this bug, I got some clue, I found the problem came from those removed OSDs (which has status DNE), after I did crush remove for those OSDs, it is good. Is the behavior expected?

Actions #2

Updated by Sage Weil over 9 years ago

  • Project changed from rgw to Ceph
Actions #3

Updated by Samuel Just over 9 years ago

Ah, I think it's your very wide EC stripe. Try increasing total retries on the crush rule for that pool.

Actions #4

Updated by Samuel Just over 9 years ago

We probably want to add a heuristic that notices if a pool might have this problem and point the user at a doc page.

Actions #5

Updated by Sage Weil over 9 years ago

  • Status changed from New to Won't Fix
Actions

Also available in: Atom PDF